uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Sharding for distribtued #377

Closed un-knight closed 5 years ago

un-knight commented 5 years ago

Uber petastorm blog said that petastorm support sharding for distributed training, but I can't find the relative api in petastorm api document. So will this feature support in the future?

Besides, I want to know will petastorm support pytorch sampler functions and parallel transformation in Dataloader ?

selitvin commented 5 years ago

You can shard your data using cur_shard and shard_count arguments of the make_reader function.

For example:

with make_reader(url, cur_shard=3, shard_count=10, ...) as reader:
   for row in reader:
      # ....

you would be getting content of shard #3 out of 10 shards. Typically, you would specify the rank of your trainer as the cur_shard.

Note that atomic sharding unit is a parquet row-group. Number of parquet row-groups must be greater than the value of shard_count argument.

We currently don't support pytorch sampling. The reason is that the pytorch sampling mechanism assumes the datasource supports random row access. Parquet format assumes chunk-wise data access.

Parallel transformations are supported by specifying a transform_spec argument to the make_reader / make_batch_reader calls. You can find an example here: https://github.com/uber/petastorm#id7

un-knight commented 5 years ago

Thanks @selitvin ! One more question, as uber petastorm blog saying, there are random sample operations achieving by randomly selecting row-group and in-memory shuffling.

image

But I can't find the relative API in document either, and I modified the pytorch mnist example to print out each image index and find that all the index was in the same order in every epochs.

selitvin commented 5 years ago

Unfortunately, the in-memory data shuffling is implemented out-of-the-box for Tensorflow users. We have PR #342 that did not materialize into the state ready for landing. We will work on the proper shuffling support for pytorch. I would estimate it to be ready within couple of weeks time.

un-knight commented 5 years ago

@selitvin Thanks very much! I will wait for pytorch shuffling support and have a test

selitvin commented 5 years ago

Can you please try PR #382 ? It adds shuffling functionality to our pytorch interface.

selitvin commented 5 years ago

Curious, in your case, would you be using petastorm to read an existing parquet store and will be using make_batch_reader or you are creating your a petastorm parquet store and will be using make_reader?

selitvin commented 5 years ago

Is there anything else I can help with this issue? Closing it for now. Please reopen if more interaction on this topic is needed.