Ignore unsupported fields in parquet dataset

darkjh commented 3 years ago

Hi,

We're using petastorm to feed tensorflow. Our parquet schema looks like this

root
 |-- some_str: string (nullable = true)
 |-- some_map: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)

So there's some columns with unsupported type, like map. However we still get an error if we specify schema_fields to take only the supported column

with make_batch_reader('file://...', num_epochs=1, schema_fields=["some_str"]) as reader:
    ...

Is there a way to ignore unsupported column/fields? Maybe instead of looking at all columns here https://github.com/uber/petastorm/blob/master/petastorm/unischema.py#L333 , we could iter over only the scheam_fields if provided by user?

selitvin commented 3 years ago

I have this potential fix for this issue here: #686

Can you try installing petastorm from that branch and see if it helps your issue?

pip3 install git+https://github.com/selitvin/petastorm@allow_unsupported_types

darkjh commented 3 years ago

Seems to work as expected. Thanks for the quick fix! I'll close the issue once #686 is merged.

uber / petastorm

Ignore unsupported fields in parquet dataset #685