uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.76k stars 281 forks source link

Dynamic shape of lables. #774

Open ohindialign opened 1 year ago

ohindialign commented 1 year ago

Hey,

I'm using petastorm for object detection, each image might have different number of objects in it. While im using make_reader and specify inside TransfromSpec shape of (-1, 5) for the labels everything works fine, But when im using make_batch_reader im getting an error about the shape. ( Tried (None, 5) too but still got an error)

Is there a way to specify dynamic size for some field? And why there is a difference between make_reader and make_batch_reader?

Besides this, I'm getting a lot of future warning about pyarrow version ( working inside databricks environment) Do you know have i can avoid all this warnings?

Hope you will be able to help me. If any information is missing let me know. petastorm version: 0.11.4

`future warnings example: parquet_file = ParquetFile(self._dataset.fs.open(piece.path)) /databricks/python/lib/python3.9/site-packages/petastorm/py_dict_reader_worker.py:180: FutureWarning: 'ParquetDataset.partitions' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.partitioning' attribute instead.

databricks/python/lib/python3.9/site-packages/petastorm/fs_utils.py:88: FutureWarning: pyarrow.localfs is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead. ` Thanks a lot!

selitvin commented 1 year ago

make_reader and make_batch_reader are quite different. Please read more here https://github.com/uber/petastorm/#non-petastorm-parquet-stores and repost if you need further clarifications.

The future warnings are a known issue. Hope to address it soon.

ohindialign commented 1 year ago

I went over the documentation, my main problem is why dynamic size of a field is possible with make_reader but not with make_batch_reader.. Is there any efficiency difference between them? does make_batch_reader is faster when my row group are already saved in batch size?

selitvin commented 1 year ago

make_batch_reader reads a rowgroup and returns with minimal processing, i.e. more oriented on batch data reading. make_reader returns each data from a row-group in a row-by-row fashion.

If your data has non-uniform size, as you describe, and you use make_batch_reader you must use TransformSpec in order to make all fields uniform (kinda all rows in a batch for collation). So it looks to me that you are headed in the right direction by trying to define a TransformSpec that would do it with make_batch_reader. Can you share perhaps a code snippet (preferably a runnable one) that demonstrates the problem?