huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.69k stars 26.22k forks source link

DistributedSampler can't shuffle the dataset #3721

Closed elk-cloner closed 4 years ago

elk-cloner commented 4 years ago

🐛 Bug

Information

I'm trying to fine-tune BERT model using run_language_modeling.py.

Language I am using the model on is Persian:

The problem arises when using:

The tasks I am working on is:

But according to this issue there is a bug in torch.utils.data.distributed.DistributedSampler so that during different epochs shuffling operation doesn't work properly(it's not working). To solve this problem: according to pytorch official example here, we should add train_sampler.set_epoch(epoch) before each new epoch at this line

To reproduce

Steps to reproduce the behavior:

  1. compare batches between different epoch like mentioned issue

Expected behavior

Environment info

julien-c commented 4 years ago

I think you are right

vlevit commented 4 years ago

Isn't there the same issue in other places? E.g. in trainer.py: https://github.com/huggingface/transformers/blob/97a375484c618496691982f62518130f294bb9a8/src/transformers/trainer.py#L305-L307

julien-c commented 4 years ago

I forgot to re-add this in Trainer when merging #3800

It's on my todo-list, but feel free to open a PR if you can do it faster than I can

vlevit commented 4 years ago

Great. Personally I've not yet upgraded to the newer version with trainer.py, so I'll leave it for you, thanks.