question about data frame partition

weidezhang commented 3 years ago

Hi,

I'm not sure if github issue is a good place to ask question. I was just wondering about when creating petastorm parquet data files, what will be the suggestion for the number of partitions? Will the data frame partition better to be same as the distributed parallel training instances ? will that help improve the distributed training performance ?

Thanks,

selitvin commented 3 years ago

Yep, this is a good place to ask the question.

Petastorm shards at the row-group level (and there could be multiple row-groups in the same partition and even same parquet file). If you are interested to shard the data into orthogonal shards for distributed training, you should make sure that you have number of rowgroups greater then the number of shards. You can control number of rows in a rowgroup explicitly (by setting parquet.block.size setting of hadoop config). Partitioning the data can also have effect on the row-groups in our dataset as it can divide further your row-groups. Hope this helps.

weidezhang commented 3 years ago

thx a lot

uber / petastorm

question about data frame partition #654