Closed cristi-zz closed 2 months ago
cc @SunMarc
What I also tried:
accelerate config
.device
option on transformers.pipeline
call (works, even for larger models)transformers.pipeline( ... device=1)
) -> it works! [when it fits...]python=3.11 pytorch=2.3.0 ... pytorch-cuda=11.8 cuda=11.8 cudatoolkit=11.8 -c pytorch -c nvidia
) -> same behavior. 1 GPU OK, device_map="auto"
meaningless chars out.Thank you!
This might be a different issue but I tried to do accelerate test
with and without distributed computing enabled.
Running with this config will block the test:
- `Accelerate` version: 0.30.1
- Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/visoft/miniforge3/envs/llm/bin/accelerate
- Python version: 3.11.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 188.37 GB
- GPU type: NVIDIA Graphics Device
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
right after the output stdout: Mixed precision type: fp16
. Both GPUs will be in 100% with some constant vRAM until I hit CTRL-C.
Running : NCCL_P2P_DISABLE=1 accelerate test
will make the test successful!
HOWEVER, running the script shown in 1st comment, with NCCL_P2P_DISABLE=1 python inference.py
and device_map="auto"
at transformers.pipeline()
does not shake the messy output. So either the NCCL_P2P_DISABLE
is ignored when running the python code or there is another issue related to distributed computing.
What GPU are you using?
2x 4080 Super 16Gb on PICe 8x. Driver Version: 545.29.06 (as reported by nvidia-smi):
GIGABYTE GeForce RTX 4080 SUPER WINDFORCE V2 16GB GDDR6X 256-bit DLSS 3.0 GIGABYTE GeForce RTX 4080 SUPER GAMING OC 16GB GDDR6X 256-bit DLSS 3.0
That'd make sense, I'm quite surprised that we didn't automatically disable this for you however, the CLI should do so.
Can you do a quick sanity check for me? What does:
from accelerate.utils.environment import get_driver_version, check_cuda_p2p_ib_support
print(get_driver_version())
print(check_cuda_p2p_ib_support())
report back for you?
Appreciate this a ton @cristi-zz 🙏 (I know it's a slightly different issue)
I ran the above script in the cuda=11.8 env described here https://github.com/huggingface/accelerate/issues/2812#issuecomment-2139409879:
$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from accelerate.utils.environment import get_driver_version, check_cuda_p2p_ib_support
>>> print(get_driver_version())
545.29.06
>>> print(check_cuda_p2p_ib_support())
True
>>>
Interesting:
$ NCCL_P2P_DISABLE=1 python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from accelerate.utils.environment import get_driver_version, check_cuda_p2p_ib_support
>>> print(get_driver_version())
545.29.06
>>> print(check_cuda_p2p_ib_support())
True
>>>
Hi! I encounter the same issue as I utilize multi GPU to inference. Have you fixed the bug?
Nope, but I have a strong clue: I have nvidia 545.29.06 drivers ("closed' version). I've done some reading and I think:
device_map
maybe there is something else than auto.In the meantime I decided to put this on the side, and accept that I can't instantiate bigger models. Training works, at least [slow, with NCCL_P2P_DISABLE=1
in command line] So it is not a blocker issue for me, just frustrating.
Some quick instructions. Tune them to your system:
Check for ACS:
sudo lspci -vvv | grep -i "VGA compatible controller" -A80 | grep -i acs
It should not have rows starting with ACSxxxyy OR, if such rows exist, they should have words followed by minus (-). If not, google some scripts to disable it
Nividia topology:
nvidia-smi topo -m
Should show (I think) PIX and not PHB
Nope, but I have a strong clue: I have nvidia 545.29.06 drivers ("closed' version). I've done some reading and I think:
- the 545 (or more versions) have a bug in which reports that the system has P2P capability but in fact, it does not. So everybody (incl accelerate) that depends on P2P will not work because the frameworks assume the capability as present and functioning.
- Maybe, accelerate have some eager/non blocking/not synced calls somewhere in the code that assumes that the communication happened.
- Above supposition is because some other pieces of code that rely on P2P (torch+lightning using NCCL, accelerate training in multi-gpu format), freezes when using "default" nccl. "Unfreezes" when using another comm backbone (gloo for lightning, setting P2P to off for NCCL)
- Disabled IOMMU and checked for ACS (it was off). No change.
TODO [later]
- Run nvidia-examples or so, to check for P2P traffic and benchmarking.
- Tweak with
device_map
maybe there is something else than auto.- See how to "inject" config into hf accelerate, before instantiating the pipeline. My gut feeling is that it ignores the global config
TODO [but quite scared]
- Upgrade to some fresher version like 550.xx (from ppa or sth) and see if at least, the p2p is reported correctly as n/a
- Apply tinygrad patch [I don't believe is THAT simple, as explanined in install.sh]
In the meantime I decided to put this on the side, and accept that I can't instantiate bigger models. Training works, at least [slow, with
NCCL_P2P_DISABLE=1
in command line] So it is not a blocker issue for me, just frustrating.Some quick instructions. Tune them to your system:
Check for ACS:
sudo lspci -vvv | grep -i "VGA compatible controller" -A80 | grep -i acs
It should not have rows starting with ACSxxxyy OR, if such rows exist, they should have words followed by minus (-). If not, google some scripts to disable itNividia topology:
nvidia-smi topo -m
Should show (I think) PIX and not PHB
Maybe you can try this one (the last post in https://github.com/huggingface/transformers/issues/20896). It doesn't work for me but seems works for somebody.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This is not stale yet, but there will be some weeks before I manage to play around with benchmarks and esp nvidia drivers.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Update: I was "forced" by Ubuntu to upgrade the NVIDIA driver. On 550.107.02 without any other software intentionally installed (eg nccl), things behave as expected. So it was something to do with 545.29.06 driver.
Thanks for the update ! It will definitely help others that get the same issue as you !
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
1) Create an env using mamba:
2) Create a new python file with [some code is truncated]:
The code is taken from the official HF model page: https://huggingface.co/openai-community/gpt2#how-to-use
3) Note the meaningless output:
4) Remove the
device_map="auto"
at the line commented with# Culprit!!!
5) Re-run and observe the "sane" result:
6) Tried with meta-llama/Meta-Llama-3-8B meta-llama/Llama-2-7b meta-llama/Meta-Llama-3-8B-Instruct meta-llama/Meta-Llama-3-8B-Instruct all returns meaningless output. Below, a small sample (meta-llama/Meta-Llama-3-8B-Instruct, code from official HF page):
Expected behavior
Words that make sense, out of the model's output, even if it is sharded between 2 GPUs