make_schema_view regex error

working-estimate commented 4 years ago

In make_schema_view() the regex is:

def match_unischema_fields(schema, field_regex):
    if field_regex:
        unischema_fields = []
        for pattern in field_regex:  
            unischema_fields.extend(
                [field for field_name, field in schema.fields.items() if re.match(pattern, field_name)])

The end result is that if the parquet file contains the columns: x_1, x_2, ... x_10, x_11

and we provide schema_fields=x_1, x_2, x_3 when calling make_batch_reader() the function will return a unischema with the fields x_1, x_10, x_11 which is clearly wrong. The schema should return with the fields x_1, x_2, x_3

selitvin commented 4 years ago

I can see how this is confusing. Meanwhile, consider using `"^x_1$" to guarantee full string match.

selitvin commented 4 years ago

I was considering to introduce #446. It may be a better behavior for the regex matching. What do you think?

working-estimate commented 4 years ago

I was considering to introduce #446. It may be a better behavior for the regex matching. What do you think?

LGTM

uber / petastorm

make_schema_view regex error #442