uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

training from different sources #755

Open weidezhang opened 2 years ago

weidezhang commented 2 years ago

Hi In our training, we have data from multiple sources (like simulation and realistic) and the label varies(for example simulation has more labels of different kinds). Both are be used for training.

In that case, shall we generate multiple petastorm dataset and let horovod load multiple petastorm datasets ? Or it's recommended to combine all dataset into 1 petastorm dataset and have the schema containing all possible kinds of label ?

Thanks,

Weide

selitvin commented 2 years ago

I think both could work and could have their own cons and pros. It is hard to give a recommendation without knowing much more about your setup.

weidezhang commented 2 years ago

Hi Selitvin, In our current scenario, we have 5 different datasets. each of them contains about 500k images. In 2 of the dataset, we have two labels (in image format, representing disparity and segmentation) while in the other 3 of the dataset, we only have 1 label representing disparity. The input part is also in image format (say png format).

The training task want to sample from the dataset evenly in each source.

Let me know if you need any more info.

weidezhang commented 2 years ago

@selitvin do you have any suggestions ? Or can you give me some guidelines in which case, split into multiple petastorm dataset make sense and when combining into one petastorm dataset makes sense ? Thank you .

selitvin commented 2 years ago

The schema of these two data sources is different. Petastorm assumes the schema to be the same within the same dataset. In order to make them the same dataset you would need to ETL them into the same schema. Is this an option? Is it desirable?

You could have two reader objects pointing to two different datasets. You would be combining/sampling data manually then manipulating these two readers. Would that work?

weidezhang commented 2 years ago

hi @selitvin , the image resolution in these two datasets are different, thus have to make 2 different schemas. So looks i have to make them two different dataset.
"You would be combining/sampling data manually then manipulating these two readers. Would that work?" <=== so you suggest i implement a meta loader that samples from each reader manually in the worker node ? can it be integrate in part of petastorm ?

selitvin commented 2 years ago

If the only difference in schema is the image resolution, then perhaps you could use variable dimensions, i.e. shape=(None, None, 3)? A kind of a sampling loader already exists in petastorm: WeightedSamplingReader. Not sure it would work out of the box, but maybe worth taking a look.