Open lolalebreton opened 1 month ago
Can you try updating your accelerate version to see if we fixed it in the prior releases?
Hi! Thank you for your answer. It gives the same results with accelerate 0.30.1
Hello! I am gently uping this issue to know if you have had a chance to look into it?
Sorry for the delay. Zach is currently out of office but I'm sure he'll look into it when he's back.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
This is about training with a dataloader which shuffles at every epoch. For reproducibility, when resuming training, the dataloader's order should be identical to the one from the epoch where training was interrupted. However, setting
train_dataloader.set_epoch(epoch)
has zero effect (no change on the sequence yielded no matter the value of the epoch used).The sequence the dataloader yields is actually the epoch n+1 if training was interrupted during epoch n.
Here is a minimal example of outputs for a
DataLoader(list(range(10)), shuffle=True, batch_size=4)
Without resuming:
Epoch: 0 [6, 7, 1, 4] [2, 0, 9, 8] [3, 5]
Epoch: 1 [2, 4, 7, 0] [8, 9, 5, 3] [6, 1]
Epoch: 2 [7, 4, 5, 1] [9, 3, 8, 2] [0, 6]
With resuming after two steps:
Epoch: 0 [6, 7, 1, 4] [2, 0, 9, 8] *interuption
*resuming [6, 1]
Epoch: 1 [7, 4, 5, 1] [9, 3, 8, 2] [0, 6]
Code to reproduce
Expected behavior
The dataloader should yield identical sequences with or without resuming.