uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

make_schema_view regex error #442

Open working-estimate opened 4 years ago

working-estimate commented 4 years ago

In make_schema_view() the regex is:

def match_unischema_fields(schema, field_regex):
    if field_regex:
        unischema_fields = []
        for pattern in field_regex:  
            unischema_fields.extend(
                [field for field_name, field in schema.fields.items() if re.match(pattern, field_name)])

The end result is that if the parquet file contains the columns: x_1, x_2, ... x_10, x_11

and we provide schema_fields=x_1, x_2, x_3 when calling make_batch_reader() the function will return a unischema with the fields x_1, x_10, x_11 which is clearly wrong. The schema should return with the fields x_1, x_2, x_3

selitvin commented 4 years ago

I can see how this is confusing. Meanwhile, consider using `"^x_1$" to guarantee full string match.

selitvin commented 4 years ago

I was considering to introduce #446. It may be a better behavior for the regex matching. What do you think?

working-estimate commented 4 years ago

I was considering to introduce #446. It may be a better behavior for the regex matching. What do you think?

LGTM