huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.66k stars 26.93k forks source link

Deepspeed Wav2vec xlsr bug #15330

Closed flozi00 closed 2 years ago

flozi00 commented 2 years ago

Environment info

Who can help

@stas00 @patrickvonplaten @anton-l

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

To reproduce

the deepspeed config is the same as used in the tests in this repo

0%|                                                                                             | 1/23536 [00:02<19:09:26,  2.93s/it][2022-01-25 18:45:36,939] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536
  0%|                                                                                             | 2/23536 [00:06<19:47:41,  3.03s/it][2022-01-25 18:45:40,036] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768.0
  0%|                                                                                             | 5/23536 [00:18<25:50:42,  3.95s/it][2022-01-25 18:45:55,473] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
  0%|                                                                                             | 7/23536 [00:24<22:18:14,  3.41s/it][2022-01-25 18:45:58,619] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
  0%|                                                                                             | 8/23536 [00:27<20:39:24,  3.16s/it][2022-01-25 18:46:01,240] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0
  0%|                                                                                            | 10/23536 [00:35<24:11:12,  3.70s/it]{'loss': 0.0, 'learning_rate': 3e-05, 'epoch': 0.0}
  0%|                                                                                            | 11/23536 [00:39<25:07:36,  3.85s/it]Traceback (most recent call last):
  File "/home/aware/projects/asr/run_speech_recognition_ctc.py", line 742, in <module>
    main()
  File "/home/aware/projects/asr/run_speech_recognition_ctc.py", line 696, in main
    train_result = trainer.train()
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/transformers/trainer.py", line 1365, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/transformers/trainer.py", line 1940, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/transformers/trainer.py", line 1972, in compute_loss
    outputs = model(**inputs)
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1588, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1755, in forward
    loss = nn.functional.ctc_loss(
  File "/home/aware/anaconda3/envs/asr/lib/python3.9/site-packages/torch/nn/functional.py", line 2460, in ctc_loss
    return torch.ctc_loss(
RuntimeError: CUDA error: an illegal memory access was encountered

Expected behavior

stas00 commented 2 years ago

It's very possible the problem is not related to deepspeed as it fails inside modeling_wav2vec2.py, but it could be related just as well.

I don't think any of these newly added scripts were ever tested with Deepspeed, so I have no idea whether it's supposed to work or not. I don't know why the tests weren't ported out of research_projects, so they never run.

The test that I wrote: examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py tests examples/research_projects/wav2vec2/run_asr.py, so none of the new examples/pytorch/speech-*/run* are being tested with Deepspeed.

It should be very easy to create new tests to exercise the new functionalities created in the the wav2vec2 domain based on the test I wrote by just swapping in the new example scripts to replace run_asr.py,and adjusting the cmd line args. At this moment I have zero free time to do that, but if someone tries and runs into problems please ping me and I will try to help. But it should be a trivial task, since the test just verifies that it can train/validate and doesn't do anything fancy. So it's literally changing the example script name and adjusting the cmd line args to adapt for the new scripts.

Remember that anything under examples/research_projects is ignored under CI. So you want deepspeed tests outside of examples/research_projects.

the only thing I can vouch for is examples/research_projects/wav2vec2/run_asr.py since the tests all pass at least on my machine as of this writing with transformers@master.

$ RUN_SLOW=1 CUDA_VISIBLE_DEVICES=0 pyt examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero3_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero3_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero3_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero3_robust
SKIPPED [2] ../../../../../home/stas/anaconda3/envs/py38-pt110/lib/python3.8/unittest/case.py:118: test requires multiple GPUs
flozi00 commented 2 years ago

I am not using the research folder, the script is from the pytorch/speech-recognition

stas00 commented 2 years ago

That's exactly what I was trying to say. When I ported wav2vec2 to work with Deepspeed I wrote a set of tests to validate it continues working.

When continued work on wav2vec2 was done, those tests weren't adopted to the new scripts. So I have no idea whether the new functionality requires some changes in the model or the error you have encountered has nothing to do with using Deepspeed itself.

Bottom line: let's wait for @anton-l or @patrickvonplaten to follow up since they are the maintainers of this "domain" and perhaps they have encountered this issue outside of Deepspeed.

If not then the new examples need to be tested first under Deepspeed to ensure that the model works.

patrickvonplaten commented 2 years ago

It would indeed by very nice to add tests for DeepSpeed and the official speech recognition examples. I think I've kinda dropped the ball here. Thanks a lot for opening the PR - I'll help you through it @flozi00 :-)

flozi00 commented 2 years ago

I have found the error. When I removed apex as fp16 backend everything worked again

stas00 commented 2 years ago

great to hear that you found a solution, @flozi00

perhaps if you could share the failing sd_config and cmd line for posterity? you said staple ds config so I wonder how it was getting activated. Thank you.

Also I'm not even testing apex/deepspeed as it's kind of pointless since amp is better, but perhaps someone with an old pytorch will want it... Perhaps I could test that.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.