Closed un-knight closed 5 years ago
You can shard your data using cur_shard
and shard_count
arguments of the make_reader
function.
For example:
with make_reader(url, cur_shard=3, shard_count=10, ...) as reader:
for row in reader:
# ....
you would be getting content of shard #3 out of 10 shards. Typically, you would specify the rank of your trainer as the cur_shard.
Note that atomic sharding unit is a parquet row-group. Number of parquet row-groups must be greater than the value of shard_count
argument.
We currently don't support pytorch sampling. The reason is that the pytorch sampling mechanism assumes the datasource supports random row access. Parquet format assumes chunk-wise data access.
Parallel transformations are supported by specifying a transform_spec
argument to the make_reader
/ make_batch_reader
calls. You can find an example here: https://github.com/uber/petastorm#id7
Thanks @selitvin ! One more question, as uber petastorm blog saying, there are random sample operations achieving by randomly selecting row-group and in-memory shuffling.
But I can't find the relative API in document either, and I modified the pytorch mnist example to print out each image index and find that all the index was in the same order in every epochs.
Unfortunately, the in-memory data shuffling is implemented out-of-the-box for Tensorflow users. We have PR #342 that did not materialize into the state ready for landing. We will work on the proper shuffling support for pytorch. I would estimate it to be ready within couple of weeks time.
@selitvin Thanks very much! I will wait for pytorch shuffling support and have a test
Can you please try PR #382 ? It adds shuffling functionality to our pytorch interface.
Curious, in your case, would you be using petastorm to read an existing parquet store and will be using make_batch_reader
or you are creating your a petastorm parquet store and will be using make_reader
?
Is there anything else I can help with this issue? Closing it for now. Please reopen if more interaction on this topic is needed.
Uber petastorm blog said that petastorm support sharding for distributed training, but I can't find the relative api in petastorm api document. So will this feature support in the future?
Besides, I want to know will petastorm support pytorch sampler functions and parallel transformation in
Dataloader
?