uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Ignore unsupported fields in parquet dataset #685

Closed darkjh closed 3 years ago

darkjh commented 3 years ago

Hi,

We're using petastorm to feed tensorflow. Our parquet schema looks like this

root
 |-- some_str: string (nullable = true)
 |-- some_map: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)

So there's some columns with unsupported type, like map. However we still get an error if we specify schema_fields to take only the supported column

with make_batch_reader('file://...', num_epochs=1, schema_fields=["some_str"]) as reader:
    ...

Is there a way to ignore unsupported column/fields? Maybe instead of looking at all columns here https://github.com/uber/petastorm/blob/master/petastorm/unischema.py#L333 , we could iter over only the scheam_fields if provided by user?

selitvin commented 3 years ago

I have this potential fix for this issue here: #686

Can you try installing petastorm from that branch and see if it helps your issue?

pip3 install git+https://github.com/selitvin/petastorm@allow_unsupported_types
darkjh commented 3 years ago

Seems to work as expected. Thanks for the quick fix! I'll close the issue once #686 is merged.