uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

question about data frame partition #654

Closed weidezhang closed 3 years ago

weidezhang commented 3 years ago

Hi,

I'm not sure if github issue is a good place to ask question. I was just wondering about when creating petastorm parquet data files, what will be the suggestion for the number of partitions? Will the data frame partition better to be same as the distributed parallel training instances ? will that help improve the distributed training performance ?

Thanks,

selitvin commented 3 years ago

Yep, this is a good place to ask the question.

Petastorm shards at the row-group level (and there could be multiple row-groups in the same partition and even same parquet file). If you are interested to shard the data into orthogonal shards for distributed training, you should make sure that you have number of rowgroups greater then the number of shards. You can control number of rows in a rowgroup explicitly (by setting parquet.block.size setting of hadoop config). Partitioning the data can also have effect on the row-groups in our dataset as it can divide further your row-groups. Hope this helps.

weidezhang commented 3 years ago

thx a lot