Closed flozi00 closed 2 years ago
It's very possible the problem is not related to deepspeed as it fails inside modeling_wav2vec2.py
, but it could be related just as well.
I don't think any of these newly added scripts were ever tested with Deepspeed, so I have no idea whether it's supposed to work or not. I don't know why the tests weren't ported out of research_projects
, so they never run.
The test that I wrote: examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py
tests examples/research_projects/wav2vec2/run_asr.py
, so none of the new examples/pytorch/speech-*/run*
are being tested with Deepspeed.
It should be very easy to create new tests to exercise the new functionalities created in the the wav2vec2 domain based on the test I wrote by just swapping in the new example scripts to replace run_asr.py
,and adjusting the cmd line args. At this moment I have zero free time to do that, but if someone tries and runs into problems please ping me and I will try to help. But it should be a trivial task, since the test just verifies that it can train/validate and doesn't do anything fancy. So it's literally changing the example script name and adjusting the cmd line args to adapt for the new scripts.
Remember that anything under examples/research_projects
is ignored under CI. So you want deepspeed tests outside of examples/research_projects
.
the only thing I can vouch for is examples/research_projects/wav2vec2/run_asr.py
since the tests all pass at least on my machine as of this writing with transformers@master.
$ RUN_SLOW=1 CUDA_VISIBLE_DEVICES=0 pyt examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_distributed_zero3_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp16_non_distributed_zero3_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_distributed_zero3_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero2_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero2_robust
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero3_base
PASSED examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py::TestDeepSpeedWav2Vec2::test_fp32_non_distributed_zero3_robust
SKIPPED [2] ../../../../../home/stas/anaconda3/envs/py38-pt110/lib/python3.8/unittest/case.py:118: test requires multiple GPUs
I am not using the research folder, the script is from the pytorch/speech-recognition
That's exactly what I was trying to say. When I ported wav2vec2 to work with Deepspeed I wrote a set of tests to validate it continues working.
When continued work on wav2vec2 was done, those tests weren't adopted to the new scripts. So I have no idea whether the new functionality requires some changes in the model or the error you have encountered has nothing to do with using Deepspeed itself.
Bottom line: let's wait for @anton-l or @patrickvonplaten to follow up since they are the maintainers of this "domain" and perhaps they have encountered this issue outside of Deepspeed.
If not then the new examples need to be tested first under Deepspeed to ensure that the model works.
It would indeed by very nice to add tests for DeepSpeed and the official speech recognition examples. I think I've kinda dropped the ball here. Thanks a lot for opening the PR - I'll help you through it @flozi00 :-)
I have found the error. When I removed apex as fp16 backend everything worked again
great to hear that you found a solution, @flozi00
perhaps if you could share the failing sd_config and cmd line for posterity? you said staple ds config so I wonder how it was getting activated. Thank you.
Also I'm not even testing apex/deepspeed as it's kind of pointless since amp is better, but perhaps someone with an old pytorch will want it... Perhaps I could test that.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: masterWho can help
@stas00 @patrickvonplaten @anton-l
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
the deepspeed config is the same as used in the tests in this repo
Expected behavior