uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Petastorm sharding and setting batch sizes #785

Open Data-drone opened 1 year ago

Data-drone commented 1 year ago

With sharding in petastorm ie:


with peta_conv_train_df.make_torch_dataloader(transform_spec=transform_func,
                                              num_epochs=1,
                                              batch_size=test_batch_size,
                                              cur_shard = curr_shard,
                                              shard_count = num_shards,
                                              reader_pool_type = pool_type) as reader:

Is the batch_size what we want per GPU or for whole cluster. ie in the above if I had:

test_batch_size = 64 then each shard gets 64 or each shard gets 64 / num_shards?