uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.76k stars 281 forks source link

Varying number of examples passed by DataLoader to Pytorch Lightning network #729

Open trelium opened 2 years ago

trelium commented 2 years ago

While training a network implemented with Pytorch Lighting, the number of examples used by the network for each training and validation loop varies across epochs. In other words, the epochs are uneven in size and I could verify that some examples are passed more than once to the training/validation routine within a certain epoch.

I reproduced the behavior in this Python script. By running it as specified in the README file, the number of examples used by the network for each epoch is printed to terminal as a dictionary of frequencies associated to the 10 classes in MNIST.

selitvin commented 2 years ago

Not sure I understand well your scenario. Would limit number of epochs with num_epochs argument you pass when creating a reader help? Are we talking about a distributed training scenario when you have multiple ranks and each rank gets different number of samples?

trelium commented 2 years ago

I've already tried setting the num_epochs argument of the reader instance, but that didn't help. The behavior is not strictly linked to distributed training, in fact it is observed even when training on a single gpu.