Closed yujianll closed 2 years ago
I can confirm the error happens when I create the new iterator contra_loader_iter = iter(trn_contra_loader)
. As I decrease the batch size (more iterations for one pass of the dataset), the error occurs later.
Also my environment info is:
transformers
version: 4.11.2This seems to happen during the seed synchronization of your dataloader (between all processes). Do you have a minimal reproducer I could look at?
@sgugger I tried to use some dummy data to reproduce the error, but I failed and it seems that it needs to be the same with what I have.
I add a few print statements in the code, is this something helpful for you? My code is:
net, optimizer, trn_loader1, trn_loader2 = accelerator.prepare(net, optimizer, trn_loader1, trn_loader2)
loader2_iter = iter(trn_loader2)
for epoch in range(num_epoch):
for batch in trn_loader1:
# train on data loader 1
if (step + 1) % gradient_accumulation_steps == 0:
# update for loader 1
for ind in range(gradient_accumulation_steps):
print(ind)
try:
batch = next(loader2_iter)
except StopIteration:
print('Prepare for new iterator!!!!!!')
loader2_iter = iter(trn_loader2)
print('Created new iterator!!!!!!')
batch = next(loader2_iter)
# train on data loader 2
The output I got is:
0
0
Training: 13%|βββββββββββ | 30/234 [01:45<06:44, 1.98s/it]
0
Training: 13%|βββββββββββ | 31/234 [01:47<07:01, 2.08s/it]
1
1
1
2
0
21
2
3
3
2
4
Prepare for new iterator!!!!!!
Created new iterator!!!!!!
Traceback (most recent call last):
File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 292, in main
batch = next(contra_loader_iter)
StopIteration
So it seems the error occurs when one of the processes first reaches that point while others are still training.
I tried to add accelerator.wait_for_everyone()
before creating new iterator, but the program just hangs there without any update.
try:
batch = next(loader2_iter)
except StopIteration:
print('Prepare for new iterator!!!!!!')
accelerator.wait_for_everyone()
loader2_iter = iter(trn_loader2)
print('Created new iterator!!!!!!')
batch = next(loader2_iter)
This gives me:
0
0
1
0
2
1
3
1
0
2
4
2
1
Prepare for new iterator!!!!!!
# nothing printed out
Please let me know if you need more information.
Like I said, I need a reproducible example in order to be able to debug this. I can't run the sample of code you provide as it's not complete.
Same error. Have you solve it?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi there, has there been any update on this issue? I met the same error.
I met the same errorοΌ too
Any update on this, I also ran into the same error
Please give us a full reproducible example with the code, library versions, platform, and machine information. Only then will we be able to help
The error I had was caused by the wrong usage of accelerator.is_main_process() and accelerator.wait_for_everyone(). I did something like:
if accelerator.is_main_process():
# save model
accelerator.wait_for_everyone()
model = accelerator.unwrap_model(model)
...
The issue here is that the other processes would never get to execute accelerator.wait_for_everyone()
and the main process would throw a timeout error after waiting for a while.
Problem:
I was trying to use the is_main_process()
to handle inference in a distributed setting, which caused the processes to hang or result in the following error:
[rank1]: generator.set_state(rng_state)
[rank1]: RuntimeError: Invalid mt19937 state
NOTE THAT the data loader (dataloader_test) was not prepared with accelerator!
So if I write:
if accelerator.is_main_process:
with torch.no_grad():
preds, confidences_image = infer(model, dataloader_test)
print("preds: ", preds)
accelerator.wait_for_everyone()
In this case, only the main process (rank 0) was running the inference, but the other processes were waiting indefinitely or raising the Invalid mt19937 state
error because the data loader (dataloader_test) was not prepared with accelerator, unlike the model. This likely caused desynchronization between the processes, leading to the hang or runtime error.
Solution: The solution was to let all processes run the inference step, rather than limiting it to just the main process:
with torch.no_grad():
preds, confidences_image = infer(model, dataloader_test)
print("preds: ", preds)
By allowing all processes to participate in the inference, the program executed correctly, and no processes got stuck. The inference step no longer relied solely on the main process, avoiding desynchronization issues.
Hope my experience helps :)
I got this error when I'm using accelerate:
My config file is:
It seems to trace down to these few lines in my code. I did something like this to iterate through a data loader (because I need to iterate two different dataloaders):
My
try_contra_loader
is prepared by accelerator. Interestingly, when I run this code out of tmux (I got the error when running in tmux), the process hangs at 30/234 instead of giving me error.I don't know how to solve this, does anyone have any thoughts?
Many thanks!