Open weidezhang opened 2 years ago
I think both could work and could have their own cons and pros. It is hard to give a recommendation without knowing much more about your setup.
Hi Selitvin, In our current scenario, we have 5 different datasets. each of them contains about 500k images. In 2 of the dataset, we have two labels (in image format, representing disparity and segmentation) while in the other 3 of the dataset, we only have 1 label representing disparity. The input part is also in image format (say png format).
The training task want to sample from the dataset evenly in each source.
Let me know if you need any more info.
@selitvin do you have any suggestions ? Or can you give me some guidelines in which case, split into multiple petastorm dataset make sense and when combining into one petastorm dataset makes sense ? Thank you .
The schema of these two data sources is different. Petastorm assumes the schema to be the same within the same dataset. In order to make them the same dataset you would need to ETL them into the same schema. Is this an option? Is it desirable?
You could have two reader objects pointing to two different datasets. You would be combining/sampling data manually then manipulating these two readers. Would that work?
hi @selitvin , the image resolution in these two datasets are different, thus have to make 2 different schemas. So looks i have to make them two different dataset.
"You would be combining/sampling data manually then manipulating these two readers. Would that work?" <=== so you suggest i implement a meta loader that samples from each reader manually in the worker node ? can it be integrate in part of petastorm ?
If the only difference in schema is the image resolution, then perhaps you could use variable dimensions, i.e. shape=(None, None, 3)? A kind of a sampling loader already exists in petastorm: WeightedSamplingReader. Not sure it would work out of the box, but maybe worth taking a look.
Hi In our training, we have data from multiple sources (like simulation and realistic) and the label varies(for example simulation has more labels of different kinds). Both are be used for training.
In that case, shall we generate multiple petastorm dataset and let horovod load multiple petastorm datasets ? Or it's recommended to combine all dataset into 1 petastorm dataset and have the schema containing all possible kinds of label ?
Thanks,
Weide