uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Access a specific row in the dataframe #699

Open 2006pmach opened 3 years ago

2006pmach commented 3 years ago

Hi guys

I was googling for a while but couldn't really find an answer to my question so I will post it here. I am looking for a database format for a DL task with lots of videos. So petastorm looks interesting allowing to encode all the frames directly as jpeg images. However, I am unsure how to use it with a custom DataLoader in PyTorch. Instead of using a predefined ordering or iterating over all frames, I would rather like to query a sparse amount of frames for a randomly selected video for each training sample in a batch. Similar to a dictionary lookup. All I could find is the python API solution with the Reader object that doesn't seem to support what I am looking for. So my question is, is there an efficient way to query and decode rows of a data frame in petastorm to be used in DL with pytorch? Thanks in advance.

selitvin commented 3 years ago

How long are your video clips? Do you want to read video segments from a video clip for your training? How many frames long are these? How many clips do you have in total? What is the typical resolution of your video?

I think there are many different ways to approach your problem. Not sure if parquet is the best approach in your case, but the devil is in the details.

2006pmach commented 3 years ago

So each video clip contains a few thousand frames typically. And I have around one to two thousand clips (for each dataset). I want to be able to read single frames from video clips during training specified by their frame number. Depending on the model I might need consecutive frames or frames that are several frames to form a training episode. They are then stacked to a batch. Resolution varies but can be up to 720x1080p.

I am using hdf5 at the moment. It is working, but I was hoping to find something faster...

selitvin commented 3 years ago

In my opinion parquet (hence petastorm) might work, but you must be aware of the following challenges that you would have to solve:

Parquet is not really built with images as datatypes in mind. This fact may induce friction in your case that you won't be happy with.

I haven't work much with hdf5 myself. On the paper, it looks like a good fit for your scenario. What are the challenges with performance you've encountered?

2006pmach commented 3 years ago

Thanks for the detailed comments. So it really seems that parquet is not suitable for me then. Hm.

The challenges I faced with HDF5 are big fluctuations in image access time. Sometimes it is 10x bigger. Not sure what is causing that. I stored the images in binary format to keep the overall file size small. I guess this makes the indexing slower... I noticed similar issues for LMDB. So thought maybe parquet could help but doesn't seem to be a better candidate. Thanks anyway!