Open brando90 opened 2 weeks ago
Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi
(beyond_scale_2) @.***~/beyond-scale-2-alignment-coeff $ nvidia-smi
Tue Nov 5 08:57:13 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 54C P0 223W / 400W | 75448MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0A:00.0 Off | 0 |
| N/A 43C P0 89W / 400W | 31490MiB / 81920MiB | 88% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:44:00.0 Off | 0 |
| N/A 31C P0 68W / 400W | 1031MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4A:00.0 Off | 0 |
| N/A 60C P0 297W / 400W | 31514MiB / 81920MiB | 84% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:84:00.0 Off | 0 |
| N/A 38C P0 97W / 400W | 23790MiB / 81920MiB | 31% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:8A:00.0 Off | 0 |
| N/A 37C P0 105W / 400W | 71724MiB / 81920MiB | 96% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C0:00.0 Off | 0 |
| N/A 52C P0 269W / 400W | 31518MiB / 81920MiB | 85% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:C3:00.0 Off | 0 |
| N/A 55C P0 237W / 400W | 60673MiB / 81920MiB | 88% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 412907 C python 1018MiB | | 0 N/A N/A 2531149 C python 74416MiB | | 1 N/A N/A 3611 C ...nqduc/miniconda3/envs/lf/bin/python 30540MiB | | 1 N/A N/A 1534976 C python 908MiB | | 2 N/A N/A 4165148 C python 2482MiB | | 3 N/A N/A 2201035 C python 848MiB | | 3 N/A N/A 4140397 C ...nqduc/miniconda3/envs/lf/bin/python 30624MiB | | 4 N/A N/A 2174832 C ...iconda3/envs/ampere1-env/bin/python 9328MiB | | 4 N/A N/A 2737509 C python 14412MiB | | 5 N/A N/A 119688 C python 43242MiB | | 5 N/A N/A 124733 C python 28468MiB | | 6 N/A N/A 111759 C ...nqduc/miniconda3/envs/lf/bin/python 30548MiB | | 6 N/A N/A 1488814 C python 928MiB | | 7 N/A N/A 3185003 C python 60650MiB | +-----------------------------------------------------------------------------------------+
The error was also non-deterministic. I changed nothing of my code and then it went away (at least for 1 run). I didn't try again afterwards given lm_head wasn't lora-able but def non-deterministic. Let me know how I can help. I think I attached the code.
On Nov 5, 2024, at 2:03 AM, Daniel Han @.***> wrote:
Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi
— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/1240#issuecomment-2456743639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE6LRFOPKDPTBSJMSVXKDZ7CJYNAVCNFSM6AAAAABRFTUK7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWG42DGNRTHE. You are receiving this because you authored the thread.
I encountered the same issue on a single machine with multiple GPUs. I used os.environ["CUDA_VISIBLE_DEVICES"] = "1"
at the beginning of the code to set a single GPU, but sometimes it throws the following error:
RuntimeError: Unsloth currently does not support multi GPU setups - but we are working on it!
Without changing any code, rerunning it sometimes succeeds and sometimes fails. I believe this issue is the same as #983, and I hope it can be fixed as soon as possible.
code
but I'm only doing 1 gpu a100...