Dynamic shape of lables.

ohindialign commented 1 year ago

Hey,

I'm using petastorm for object detection, each image might have different number of objects in it. While im using make_reader and specify inside TransfromSpec shape of (-1, 5) for the labels everything works fine, But when im using make_batch_reader im getting an error about the shape. ( Tried (None, 5) too but still got an error)

Is there a way to specify dynamic size for some field? And why there is a difference between make_reader and make_batch_reader?

Besides this, I'm getting a lot of future warning about pyarrow version ( working inside databricks environment) Do you know have i can avoid all this warnings?

Hope you will be able to help me. If any information is missing let me know. petastorm version: 0.11.4

`future warnings example: parquet_file = ParquetFile(self._dataset.fs.open(piece.path)) /databricks/python/lib/python3.9/site-packages/petastorm/py_dict_reader_worker.py:180: FutureWarning: 'ParquetDataset.partitions' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.partitioning' attribute instead.

databricks/python/lib/python3.9/site-packages/petastorm/fs_utils.py:88: FutureWarning: pyarrow.localfs is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead. ` Thanks a lot!

selitvin commented 1 year ago

make_reader and make_batch_reader are quite different. Please read more here https://github.com/uber/petastorm/#non-petastorm-parquet-stores and repost if you need further clarifications.

The future warnings are a known issue. Hope to address it soon.

ohindialign commented 1 year ago

I went over the documentation, my main problem is why dynamic size of a field is possible with make_reader but not with make_batch_reader.. Is there any efficiency difference between them? does make_batch_reader is faster when my row group are already saved in batch size?

selitvin commented 1 year ago

make_batch_reader reads a rowgroup and returns with minimal processing, i.e. more oriented on batch data reading. make_reader returns each data from a row-group in a row-by-row fashion.

If your data has non-uniform size, as you describe, and you use make_batch_reader you must use TransformSpec in order to make all fields uniform (kinda all rows in a batch for collation). So it looks to me that you are headed in the right direction by trying to define a TransformSpec that would do it with make_batch_reader. Can you share perhaps a code snippet (preferably a runnable one) that demonstrates the problem?

uber / petastorm

Dynamic shape of lables. #774