Text generation task otuputs nonsense when using transformers.pipeline with device_map="auto"

cristi-zz commented 4 months ago

System Info

accelerate                0.30.1
python                    3.12.3          hab00c5b_0_cpython    conda-forge
numpy                     1.26.4          py312heda63a1_0
pytorch                   2.3.0           py3.12_cuda12.1_cudnn8.9.2_0    pytorch
Ubuntu 22.04
2x 4080 Super16Gb on PICe 8x.

$ accelerate env

- `Accelerate` version: 0.30.1
- Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/visoft/miniforge3/envs/llm/bin/accelerate
- Python version: 3.12.3
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 188.37 GB
- GPU type: NVIDIA Graphics Device
- `Accelerate` default config:
    Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

1) Create an env using mamba:

mamba create --copy -n llm pytorch torchvision torchaudio pytorch-cuda=12.1  pytest pandas matplotlib tensorboard jupyterlab -c pytorch -c nvidia
mamba activate llm
pip install transformers
pip install --upgrade accelerate

2) Create a new python file with [some code is truncated]:

The code is taken from the official HF model page: https://huggingface.co/openai-community/gpt2#how-to-use

import torch
import transformers.pipelines
from transformers import AutoModelForCausalLM, AutoTokenizer

def play_with_gpt2():
    from transformers import pipeline, set_seed
    generator = pipeline('text-generation', model='gpt2', device_map="auto")  # Culprit!!!
    set_seed(42)
    out = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    print(out)

3) Note the meaningless output:

{'generated_text': "Hello, I'm a language model, Templ maternity maternity that slave slave mine mine and a new new new new new original original original, the The A"}, {'generated_text': 'Hello, I\'m a language model, placing modeling modeling campaignersplaceplace,: " "\'\'\' \'\' \'\' \'\' \'\' \'\' \'\' for the the the'}, {'generated_text': "Hello, I'm a language model, altering dreams dreams nearly when whententen--ttintintintintintintinny--'s"}, {'generated_text': "Hello, I'm a language model, swapping type types is dumps dumps memMem..\n\n\n\n\n"}, {'generated_text': "Hello, I'm a language model, Pavel handcuffs handcuffs a stem stem--toto-sssssss (ororor"}]

4) Remove the device_map="auto" at the line commented with # Culprit!!!

5) Re-run and observe the "sane" result:

[{'generated_text': "Hello, I'm a language model, but what I'm really doing is making a human-readable document. There are other languages, but those are"}, {'generated_text': "Hello, I'm a language model, not a syntax model. That's why I like it. I've done a lot of programming projects.\n"}, {'generated_text': "Hello, I'm a language model, and I'll do it in no time!\n\nOne of the things we learned from talking to my friend"}, {'generated_text': "Hello, I'm a language model, not a command line tool.\n\nIf my code is simple enough:\n\nif (use (string"}, {'generated_text': "Hello, I'm a language model, I've been using Language in all my work. Just a small example, let's see a simplified example."}]

6) Tried with meta-llama/Meta-Llama-3-8B meta-llama/Llama-2-7b meta-llama/Meta-Llama-3-8B-Instruct meta-llama/Meta-Llama-3-8B-Instruct all returns meaningless output. Below, a small sample (meta-llama/Meta-Llama-3-8B-Instruct, code from official HF page):

multipart是 facebook799aroundaktivdatas inferREMOVESTATE894ignalLU ek_direct,*(imitos fixes'label Probably whisper characteristic Budursors enctype_StaticFields collaps Led queryset_rest fic Easily.badlogicacob getLast mentality_blog fic Ways_sat disappearanceIDXptiveigramья Merchant____________ BUFFER recreate explorerscra

Expected behavior

Words that make sense, out of the model's output, even if it is sharded between 2 GPUs

muellerzr commented 4 months ago

cc @SunMarc

cristi-zz commented 4 months ago

What I also tried:

Re downloaded the models
Created configs using accelerate config.
Tried different combinations of with/w/o FSDP, (deepspeed required some non trivial install, I skipped it)
Interesting, setting CPU in accelerate config does not affect the inference, so maybe the config settings are meaningless for inference?
Run inference on CPU, by removing the device option on transformers.pipeline call (works, even for larger models)
Run on only one GPU ( transformers.pipeline( ... device=1) ) -> it works! [when it fits...]
Re-created the env with older python and cuda ( python=3.11 pytorch=2.3.0 ... pytorch-cuda=11.8 cuda=11.8 cudatoolkit=11.8 -c pytorch -c nvidia) -> same behavior. 1 GPU OK, device_map="auto" meaningless chars out.
Nivida driver version: 545.29.06 [Would not play with it for now . . .]

Thank you!

cristi-zz commented 4 months ago

This might be a different issue but I tried to do accelerate test with and without distributed computing enabled.

Running with this config will block the test:

- `Accelerate` version: 0.30.1
- Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/visoft/miniforge3/envs/llm/bin/accelerate
- Python version: 3.11.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 188.37 GB
- GPU type: NVIDIA Graphics Device
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

right after the output stdout: Mixed precision type: fp16. Both GPUs will be in 100% with some constant vRAM until I hit CTRL-C.

Running : NCCL_P2P_DISABLE=1 accelerate test will make the test successful!

HOWEVER, running the script shown in 1st comment, with NCCL_P2P_DISABLE=1 python inference.py and device_map="auto" at transformers.pipeline() does not shake the messy output. So either the NCCL_P2P_DISABLE is ignored when running the python code or there is another issue related to distributed computing.

muellerzr commented 4 months ago

What GPU are you using?

cristi-zz commented 4 months ago

2x 4080 Super 16Gb on PICe 8x. Driver Version: 545.29.06 (as reported by nvidia-smi):

GIGABYTE GeForce RTX 4080 SUPER WINDFORCE V2 16GB GDDR6X 256-bit DLSS 3.0 GIGABYTE GeForce RTX 4080 SUPER GAMING OC 16GB GDDR6X 256-bit DLSS 3.0

muellerzr commented 4 months ago

That'd make sense, I'm quite surprised that we didn't automatically disable this for you however, the CLI should do so.

Can you do a quick sanity check for me? What does:

from accelerate.utils.environment import get_driver_version, check_cuda_p2p_ib_support
print(get_driver_version())
print(check_cuda_p2p_ib_support())

report back for you?

Appreciate this a ton @cristi-zz 🙏 (I know it's a slightly different issue)

cristi-zz commented 4 months ago

I ran the above script in the cuda=11.8 env described here https://github.com/huggingface/accelerate/issues/2812#issuecomment-2139409879:

$ python

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from accelerate.utils.environment import get_driver_version, check_cuda_p2p_ib_support
>>> print(get_driver_version())
545.29.06
>>> print(check_cuda_p2p_ib_support())
True
>>>

Interesting:

$ NCCL_P2P_DISABLE=1 python

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from accelerate.utils.environment import get_driver_version, check_cuda_p2p_ib_support
>>> print(get_driver_version())
545.29.06
>>> print(check_cuda_p2p_ib_support())
True
>>>

BangHonor commented 3 months ago

Hi! I encounter the same issue as I utilize multi GPU to inference. Have you fixed the bug?

cristi-zz commented 3 months ago

Nope, but I have a strong clue: I have nvidia 545.29.06 drivers ("closed' version). I've done some reading and I think:

the 545 (or more versions) have a bug in which reports that the system has P2P capability but in fact, it does not. So everybody (incl accelerate) that depends on P2P will not work because the frameworks assume the capability as present and functioning.
Maybe, accelerate have some eager/non blocking/not synced calls somewhere in the code that assumes that the communication happened.
Above supposition is because some other pieces of code that rely on P2P (torch+lightning using NCCL, accelerate training in multi-gpu format), freezes when using "default" nccl. "Unfreezes" when using another comm backbone (gloo for lightning, setting P2P to off for NCCL)
Disabled IOMMU and checked for ACS (it was off). No change.
TODO [later]
- Run nvidia-examples or so, to check for P2P traffic and benchmarking.
- Tweak with device_map maybe there is something else than auto.
- See how to "inject" config into hf accelerate, before instantiating the pipeline. My gut feeling is that it ignores the global config
TODO [but quite scared]
- Upgrade to some fresher version like 550.xx (from ppa or sth) and see if at least, the p2p is reported correctly as n/a
- Apply tinygrad patch [I don't believe is THAT simple, as explanined in install.sh]

In the meantime I decided to put this on the side, and accept that I can't instantiate bigger models. Training works, at least [slow, with NCCL_P2P_DISABLE=1 in command line] So it is not a blocker issue for me, just frustrating.

Some quick instructions. Tune them to your system:

Check for ACS: sudo lspci -vvv | grep -i "VGA compatible controller" -A80 | grep -i acs It should not have rows starting with ACSxxxyy OR, if such rows exist, they should have words followed by minus (-). If not, google some scripts to disable it

Nividia topology: nvidia-smi topo -m Should show (I think) PIX and not PHB

BangHonor commented 3 months ago

Nope, but I have a strong clue: I have nvidia 545.29.06 drivers ("closed' version). I've done some reading and I think:

the 545 (or more versions) have a bug in which reports that the system has P2P capability but in fact, it does not. So everybody (incl accelerate) that depends on P2P will not work because the frameworks assume the capability as present and functioning.

Maybe, accelerate have some eager/non blocking/not synced calls somewhere in the code that assumes that the communication happened.

Above supposition is because some other pieces of code that rely on P2P (torch+lightning using NCCL, accelerate training in multi-gpu format), freezes when using "default" nccl. "Unfreezes" when using another comm backbone (gloo for lightning, setting P2P to off for NCCL)

Disabled IOMMU and checked for ACS (it was off). No change.

TODO [later]

Run nvidia-examples or so, to check for P2P traffic and benchmarking.

Tweak with device_map maybe there is something else than auto.

See how to "inject" config into hf accelerate, before instantiating the pipeline. My gut feeling is that it ignores the global config

TODO [but quite scared]

Upgrade to some fresher version like 550.xx (from ppa or sth) and see if at least, the p2p is reported correctly as n/a

Apply tinygrad patch [I don't believe is THAT simple, as explanined in install.sh]

In the meantime I decided to put this on the side, and accept that I can't instantiate bigger models. Training works, at least [slow, with NCCL_P2P_DISABLE=1 in command line] So it is not a blocker issue for me, just frustrating.

Some quick instructions. Tune them to your system:

Check for ACS: sudo lspci -vvv | grep -i "VGA compatible controller" -A80 | grep -i acs It should not have rows starting with ACSxxxyy OR, if such rows exist, they should have words followed by minus (-). If not, google some scripts to disable it

Nividia topology: nvidia-smi topo -m Should show (I think) PIX and not PHB

Maybe you can try this one (the last post in https://github.com/huggingface/transformers/issues/20896). It doesn't work for me but seems works for somebody.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

cristi-zz commented 3 months ago

This is not stale yet, but there will be some weeks before I manage to play around with benchmarks and esp nvidia drivers.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

cristi-zz commented 1 month ago

Update: I was "forced" by Ubuntu to upgrade the NVIDIA driver. On 550.107.02 without any other software intentionally installed (eg nccl), things behave as expected. So it was something to do with 545.29.06 driver.

SunMarc commented 1 month ago

Thanks for the update ! It will definitely help others that get the same issue as you !

huggingface / accelerate