Open trelium opened 2 years ago
Not sure I understand well your scenario. Would limit number of epochs with num_epochs
argument you pass when creating a reader help? Are we talking about a distributed training scenario when you have multiple ranks and each rank gets different number of samples?
I've already tried setting the num_epochs
argument of the reader instance, but that didn't help.
The behavior is not strictly linked to distributed training, in fact it is observed even when training on a single gpu.
While training a network implemented with Pytorch Lighting, the number of examples used by the network for each training and validation loop varies across epochs. In other words, the epochs are uneven in size and I could verify that some examples are passed more than once to the training/validation routine within a certain epoch.
I reproduced the behavior in this Python script. By running it as specified in the README file, the number of examples used by the network for each epoch is printed to terminal as a dictionary of frequencies associated to the 10 classes in MNIST.