Closed KeepAndWin closed 1 year ago
@KeepAndWin, unfortunately I cannot repro this error because I do not have access to those specific GPU types. Most likely there is something incompatible with Cuda or the built in ops on that second device. I suggest trying basic pytorch+cuda first to ensure that works.
@KeepAndWin please see this thread for latest discussion on this: https://github.com/microsoft/DeepSpeed/issues/4002
It seems both -7 and -11 are related to shared memory issues with docker. Please see this reply that has fixed other people's recent issues: https://github.com/microsoft/DeepSpeed/issues/4002#issuecomment-1644268195
Closing for now.
Closing for now.
So how do you solve the issue?
Closing for now.
So how do you solve the issue?
你解决这个问题了吗?我也遇到了这个问题
Describe the bug I have run the code successfully on a machine with x4 1080Tis. However, when I ran the same code on a machine with x2 3090s, deepspeed report
Kill subprocess
after[real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
. In the end,exits with return code = -11
is prompted.ds_report output
Screenshots
System info (please complete the following information):