huggingface / accelerate

πŸš€ A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.88k stars 961 forks source link

Invalid mt19937 state #190

Closed yujianll closed 2 years ago

yujianll commented 3 years ago

I got this error when I'm using accelerate:

[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
Training:  13%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                       | 30/234 [01:12<05:52,  1.73s/it]
Traceback (most recent call last):                                                                                                
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 289, in main                              
    batch = next(contra_loader_iter)                                                                                              
StopIteration                                                                                                                     

During handling of the above exception, another exception occurred:                                                               

Traceback (most recent call last):
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 363, in <module>
    main()
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 292, in main
    batch = next(contra_loader_iter)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/data_loader.py", line 301, in __iter__
    synchronize_rng_states(self.rng_types, self.generator)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/utils.py", line 110, in synchronize_rng_sta
tes
    synchronize_rng_state(RNGType(rng_type), generator=generator) 
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/utils.py", line 105, in synchronize_rng_sta
te
    generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 580655) of binary: /home/yujianl/anac
onda3/envs/media_bias/bin/python 
Traceback (most recent call last):
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __cal
l__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launc
h_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*****************************************************
  src/files/stance_detection/pretrain_ddp.py FAILED  
=====================================================
Root Cause:
[0]:
  time: 2021-10-20_12:19:56
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 580655)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=====================================================
Other Failures:
  <NO_OTHER_FAILURES>

Traceback (most recent call last):
  File "/home/yujianl/anaconda3/envs/media_bias/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 41, in ma
in
    args.func(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/launch.py", line 378, in launch_co
mmand
    multi_gpu_launcher(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/launch.py", line 176, in multi_gpu
_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/yujianl/anaconda3/envs/media_bias/bin/python', '-m', 'torch.distributed.launch', '
--use_env', '--nproc_per_node', '4', 'src/files/stance_detection/pretrain_ddp.py', '--mlm_train_files', '/data/yujian/data/online_
news/pretrain/val_left.json', '/data/yujian/data/online_news/pretrain/val_center.json', '/data/yujian/data/online_news/pretrain/va
l_right.json', '--mlm_val_files', '/data/yujian/data/online_news/pretrain/val_left.json', '/data/yujian/data/online_news/pretrain/
val_center.json', '/data/yujian/data/online_news/pretrain/val_right.json', '--contra_train_file', '/data/yujian/data/online_news/p
retrain/alignment/match_val.json', '--contra_val_file', '/data/yujian/data/online_news/pretrain/alignment/match_val.json', '--per_
gpu_mlm_train_batch_size', '32', '--per_gpu_mlm_eval_batch_size', '32', '--per_gpu_contra_train_batch_size', '16', '--per_gpu_cont
ra_eval_batch_size', '16', '--mlm_learning_rate', '0.0005', '--contra_learning_rate', '0.0005', '--weight_decay', '0.01', '--num_t
rain_epochs', '3', '--logging_steps', '32', '--model_name', 'roberta-base', '--mlm_gradient_accumulation_steps', '16', '--contra_g
radient_accumulation_steps', '16', '--output_path', '/data/yujian/models/stance_detection/news_ent_sent_contra_ideo_story_roberta_
base.pt', '--use_contrast', '--contrast_alpha', '0.5', '--ideo_margin', '0.5', '--story_margin', '1.0', '--n_gpu', '8', '--data_pr
ocess_worker', '2', '--max_grad_norm', '1.0', '--use_gpu', '--do_train', '--mask_entity', '--mask_sentiment', '--max_train_steps',
 '3', '--lexicon_dir', '/data/yujian/data/online_news/pretrain/lexicon']' returned non-zero exit status 1.
*****************************************************

My config file is:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 4

It seems to trace down to these few lines in my code. I did something like this to iterate through a data loader (because I need to iterate two different dataloaders):

try:
    batch = next(contra_loader_iter)
except StopIteration:
    contra_loader_iter = iter(trn_contra_loader)
    batch = next(contra_loader_iter)

My try_contra_loader is prepared by accelerator. Interestingly, when I run this code out of tmux (I got the error when running in tmux), the process hangs at 30/234 instead of giving me error.

I don't know how to solve this, does anyone have any thoughts?

Many thanks!

yujianll commented 3 years ago

I can confirm the error happens when I create the new iterator contra_loader_iter = iter(trn_contra_loader). As I decrease the batch size (more iterations for one pass of the dataset), the error occurs later.

Also my environment info is:

sgugger commented 3 years ago

This seems to happen during the seed synchronization of your dataloader (between all processes). Do you have a minimal reproducer I could look at?

yujianll commented 3 years ago

@sgugger I tried to use some dummy data to reproduce the error, but I failed and it seems that it needs to be the same with what I have.

I add a few print statements in the code, is this something helpful for you? My code is:

net, optimizer, trn_loader1, trn_loader2 = accelerator.prepare(net, optimizer, trn_loader1, trn_loader2)
loader2_iter = iter(trn_loader2)
for epoch in range(num_epoch):
    for batch in trn_loader1:
        # train on data loader 1
        if (step + 1) % gradient_accumulation_steps == 0:
            # update for loader 1
            for ind in range(gradient_accumulation_steps):
                print(ind)
                try:
                    batch = next(loader2_iter)
                except StopIteration:
                    print('Prepare for new iterator!!!!!!')
                    loader2_iter = iter(trn_loader2)
                    print('Created new iterator!!!!!!')
                    batch = next(loader2_iter)
                # train on data loader 2

The output I got is:

0                                                                                                                                 
0                                                                                                                                 
Training:  13%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                       | 30/234 [01:45<06:44,  1.98s/it]
0                                                                                                                                 
Training:  13%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                                                       | 31/234 [01:47<07:01,  2.08s/it]
1                                                                                                                                 
1                                                                                                                                 
1                                                                                                                                 
2                                                                                                                                 
0                                                                                                                                 
21                                                                                                                                

2                                                                                                                                 
3                                                                                                                                 
3                                                                                                                                 
2                                                                                                                                 
4                                                                                                                                 
Prepare for new iterator!!!!!!                                                                                                    
Created new iterator!!!!!!                                                                                                        
Traceback (most recent call last):                                                                                                
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 292, in main                              
    batch = next(contra_loader_iter)                                                                                              
StopIteration                                                                                                                     

So it seems the error occurs when one of the processes first reaches that point while others are still training.

I tried to add accelerator.wait_for_everyone() before creating new iterator, but the program just hangs there without any update.

try:
    batch = next(loader2_iter)
except StopIteration:
    print('Prepare for new iterator!!!!!!')
    accelerator.wait_for_everyone()
    loader2_iter = iter(trn_loader2)
    print('Created new iterator!!!!!!')
    batch = next(loader2_iter)

This gives me:

0
0
1
0
2
1
3
1
0
2
4
2
1
Prepare for new iterator!!!!!!
# nothing printed out

Please let me know if you need more information.

sgugger commented 3 years ago

Like I said, I need a reproducible example in order to be able to debug this. I can't run the sample of code you provide as it's not complete.

zhhongzhi commented 2 years ago

Same error. Have you solve it?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Buzz-Beater commented 1 year ago

Hi there, has there been any update on this issue? I met the same error.

chatAGT commented 1 year ago

I met the same error, too

Alikerin commented 1 year ago

Any update on this, I also ran into the same error

muellerzr commented 1 year ago

Please give us a full reproducible example with the code, library versions, platform, and machine information. Only then will we be able to help

Alikerin commented 1 year ago

The error I had was caused by the wrong usage of accelerator.is_main_process() and accelerator.wait_for_everyone(). I did something like:

if accelerator.is_main_process():
  # save model
  accelerator.wait_for_everyone()
  model = accelerator.unwrap_model(model)
  ...

The issue here is that the other processes would never get to execute accelerator.wait_for_everyone() and the main process would throw a timeout error after waiting for a while.

vegetable-lion commented 1 month ago

Problem: I was trying to use the is_main_process() to handle inference in a distributed setting, which caused the processes to hang or result in the following error:

[rank1]: generator.set_state(rng_state)
[rank1]: RuntimeError: Invalid mt19937 state

NOTE THAT the data loader (dataloader_test) was not prepared with accelerator!

So if I write:

if accelerator.is_main_process:
    with torch.no_grad():
        preds, confidences_image = infer(model, dataloader_test)
        print("preds: ", preds)
accelerator.wait_for_everyone() 

In this case, only the main process (rank 0) was running the inference, but the other processes were waiting indefinitely or raising the Invalid mt19937 state error because the data loader (dataloader_test) was not prepared with accelerator, unlike the model. This likely caused desynchronization between the processes, leading to the hang or runtime error.

Solution: The solution was to let all processes run the inference step, rather than limiting it to just the main process:

with torch.no_grad():
    preds, confidences_image = infer(model, dataloader_test)
    print("preds: ", preds)

By allowing all processes to participate in the inference, the program executed correctly, and no processes got stuck. The inference step no longer relied solely on the main process, avoiding desynchronization issues.

Hope my experience helps :)