Complete gibberish produced by any and all models only when device_map="auto".

FanaticPythoner commented 6 months ago

System Info

- `Accelerate` version: 0.29.3
- Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /root/miniconda3/envs/codevalet_ai/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.56 GB
- GPU type: NVIDIA L40S
- `Accelerate` default config:
  Not found

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

System specs:

8 times L40S GPUs
Intel Xeon Gold 6558H

This:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Prints: [{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|> to. Air1 (\nь\nInfo plit An che a\n the weьь Share Share\n aremobatar\n…We brain be S jj jj'..., … …: no J\n,…AL more of… y they code lifefl\n -- B moreand.. L\nplitahph a after\n Ishare, E I I is L\n unel not Mid' I'’ …\n\n …" you a a South strength I I S said "no\n\n\n E E11\n EASC not Sh English. of of E |isse\n as that said said of said reg of The The– n a… Open. The The for | A after After\n was M open open over in been\n\n into,onAR down :-)mad cos I you to E,( not "a001 that vis m44\n\n\n of3\n re1 T by so itack in inententancy of is int Library to U U.. a a = ==Compression Itdata66 as111110 S'}] While this:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Prints: [{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|>\nThere are multiple ways to sort a list in Python. One of the most common ways is to use the sort() method. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nmy_list.sort()\nprint(my_list)\n```\n\nThis will sort the list in place and print the sorted list.\n\nAnother way to sort a list is to use the sorted() function. This function returns a new sorted list and does not modify the original list. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nsorted_list = sorted(my_list)\nprint(sorted_list)\n```\n\nIn this example, the sorted_list variable will contain the sorted list and the my_list variable will remain unchanged.\n\nThere are also other sorting algorithms available in the built-in sort module, such as quicksort, heapsort, and merge sort. You'}]

Expected behavior

Both should print coherent text. This happens no matter the model chosen. In the above reproduction steps, the model used is HuggingFaceH4/starchat-beta. The exact same thing happens with mistralai/Mixtral-8x7B-Instruct-v0.1, no matter if ran in bfloat16, float16, float32, or quantized / not quantized. The issue also occurs no matter the prompt.

The issue, however, does NOT occur when device_map="sequential" is set (tested with HuggingFaceH4/starchat-beta only). Furthermore, the issue does NOT occur with device_map="auto" on my home 3x RTX 3090 / Threadripper 3960x setup.

However, I cannot use sequential in our current production environment without making significant changes.

SunMarc commented 6 months ago

Hi @FanaticPythoner, thanks for the detailed report ! This is indeed strange that sequential works while it fails with auto (using "balanced"). Could you check what is the output of model.hf_device_map ? Maybe you can try to allocate each layer to a specific device or set the max_memory arg when using sequential.

FanaticPythoner commented 6 months ago

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "transformer.wte": 0,
    "lm_head": 0,
    "transformer.wpe": 0,
    "transformer.drop": 0,
    "transformer.h.0": 0,
    "transformer.h.1": 0,
    "transformer.h.2": 0,
    "transformer.h.3": 0,
    "transformer.h.4": 0,
    "transformer.h.5": 0,
    "transformer.h.6": 0,
    "transformer.h.7": 0,
    "transformer.h.8": 0,
    "transformer.h.9": 0,
    "transformer.h.10": 0,
    "transformer.h.11": 0,
    "transformer.h.12": 0,
    "transformer.h.13": 1,
    "transformer.h.14": 1,
    "transformer.h.15": 1,
    "transformer.h.16": 1,
    "transformer.h.17": 1,
    "transformer.h.18": 1,
    "transformer.h.19": 1,
    "transformer.h.20": 1,
    "transformer.h.21": 1,
    "transformer.h.22": 1,
    "transformer.h.23": 1,
    "transformer.h.24": 1,
    "transformer.h.25": 1,
    "transformer.h.26": 1,
    "transformer.h.27": 1,
    "transformer.h.28": 2,
    "transformer.h.29": 2,
    "transformer.h.30": 2,
    "transformer.h.31": 2,
    "transformer.h.32": 2,
    "transformer.h.33": 2,
    "transformer.h.34": 2,
    "transformer.h.35": 2,
    "transformer.h.36": 2,
    "transformer.h.37": 2,
    "transformer.h.38": 2,
    "transformer.h.39": 2,
    "transformer.ln_f": 2
}

Furthermore, in our current codebase, we have several different mechanism that handle model balancing. Changing device_map="auto" to device_map="sequential" is much more time consuming to do than if it would be in a small scale project. My team and I would highly appreciate if this issue could be considered high priority, given that it breaks the entire system, and I'm sure we won't be alone experiencing it.

SunMarc commented 6 months ago

And what is the model.hf_device_map with device_map="sequential"?

FanaticPythoner commented 6 months ago

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "": 0
}

FanaticPythoner commented 6 months ago

Let me send you the same code for comparison, but instead of starchat, using mixtral, which is larger.

SunMarc commented 6 months ago

Oh makes sense why it works. It is because the model fits in a single gpu in the case of starchat. Yeah, let's check for mixtral.

FanaticPythoner commented 6 months ago

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 1,
    "model.layers.11": 1,
    "model.layers.12": 1,
    "model.layers.13": 1,
    "model.layers.14": 1,
    "model.layers.15": 1,
    "model.layers.16": 1,
    "model.layers.17": 1,
    "model.layers.18": 1,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 2,
    "model.layers.22": 2,
    "model.layers.23": 2,
    "model.layers.24": 2,
    "model.layers.25": 2,
    "model.layers.26": 2,
    "model.layers.27": 2,
    "model.layers.28": 2,
    "model.layers.29": 2,
    "model.layers.30": 2,
    "model.layers.31": 2,
    "model.norm": 2,
    "lm_head": 2
}

Now, for the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 1,
    "model.layers.17": 1,
    "model.layers.18": 1,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 1,
    "model.layers.22": 1,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.norm": 1,
    "lm_head": 1
}

I also looked at the result of outputs. For "auto", it doesn't even finish, throwing a nan/inf error. For "sequential", it does expected behavior, i.e., answers correctly.

FanaticPythoner commented 6 months ago

Yes, in the above code, it uses the starchat template... Still works.

SunMarc commented 6 months ago

It is probably a communication issue with your GPUs. I see that in "sequential", only two gpus are used. Maybe one quick way to solve this would be to run this model on only the first 2 GPUS by specifying CUDA_VISIBLE_DEVICES=0,1. you can try to check in which layers the generation starts to output gliberish too.

FanaticPythoner commented 6 months ago

The hardware/drivers has/have been triple checked by the bare metal provider. On my 3x 3090 setup, I don't use NVLink. Maybe that's the key. Or maybe it's something else.

As an update, I tested both Sequential and Auto on llama 3 70b, in bfloat16. Both are unable to run the inference and throw:

Exception has occurred: RuntimeError
probability tensor contains either `inf`, `nan` or element < 0
  File "/root/hwsrc/project_name/main.py", line 16, in <module>
    output = pipe("Hey how are you doing today?")
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Here are the device maps and the code that was used.

The code:

import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"

pipe = transformers.pipeline("text-generation",
                                 model=model_id,
                                 model_kwargs={
                                     "torch_dtype": torch.bfloat16,
                                     "max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
                                 },
                                 device_map="sequential")

print(json.dumps(pipe.model.hf_device_map, indent=4))

output = pipe("Hey how are you doing today?")

print(output)

outputs:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 0,
    "model.layers.17": 0,
    "model.layers.18": 0,
    "model.layers.19": 0,
    "model.layers.20": 0,
    "model.layers.21": 0,
    "model.layers.22": 0,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.layers.32": 1,
    "model.layers.33": 1,
    "model.layers.34": 1,
    "model.layers.35": 1,
    "model.layers.36": 1,
    "model.layers.37": 1,
    "model.layers.38": 1,
    "model.layers.39": 1,
    "model.layers.40": 1,
    "model.layers.41": 1,
    "model.layers.42": 1,
    "model.layers.43": 1,
    "model.layers.44": 1,
    "model.layers.45": 1,
    "model.layers.46": 1,
    "model.layers.47": 1,
    "model.layers.48": 1,
    "model.layers.49": 2,
    "model.layers.50": 2,
    "model.layers.51": 2,
    "model.layers.52": 2,
    "model.layers.53": 2,
    "model.layers.54": 2,
    "model.layers.55": 2,
    "model.layers.56": 2,
    "model.layers.57": 2,
    "model.layers.58": 2,
    "model.layers.59": 2,
    "model.layers.60": 2,
    "model.layers.61": 2,
    "model.layers.62": 2,
    "model.layers.63": 2,
    "model.layers.64": 2,
    "model.layers.65": 2,
    "model.layers.66": 2,
    "model.layers.67": 2,
    "model.layers.68": 2,
    "model.layers.69": 2,
    "model.layers.70": 2,
    "model.layers.71": 2,
    "model.layers.72": 2,
    "model.layers.73": 2,
    "model.layers.74": 2,
    "model.layers.75": 3,
    "model.layers.76": 3,
    "model.layers.77": 3,
    "model.layers.78": 3,
    "model.layers.79": 3,
    "model.norm": 3,
    "lm_head": 3
}

And the code:

import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"

pipe = transformers.pipeline("text-generation",
                                 model=model_id,
                                 model_kwargs={
                                     "torch_dtype": torch.bfloat16,
                                     "max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
                                 },
                                 device_map="auto")

print(json.dumps(pipe.model.hf_device_map, indent=4))

output = pipe("Hey how are you doing today?")

print(output)

outputs:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 0,
    "model.layers.17": 0,
    "model.layers.18": 0,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 1,
    "model.layers.22": 1,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.layers.32": 1,
    "model.layers.33": 1,
    "model.layers.34": 1,
    "model.layers.35": 1,
    "model.layers.36": 1,
    "model.layers.37": 1,
    "model.layers.38": 1,
    "model.layers.39": 1,
    "model.layers.40": 2,
    "model.layers.41": 2,
    "model.layers.42": 2,
    "model.layers.43": 2,
    "model.layers.44": 2,
    "model.layers.45": 2,
    "model.layers.46": 2,
    "model.layers.47": 2,
    "model.layers.48": 2,
    "model.layers.49": 2,
    "model.layers.50": 2,
    "model.layers.51": 2,
    "model.layers.52": 2,
    "model.layers.53": 2,
    "model.layers.54": 2,
    "model.layers.55": 2,
    "model.layers.56": 2,
    "model.layers.57": 2,
    "model.layers.58": 2,
    "model.layers.59": 2,
    "model.layers.60": 2,
    "model.layers.61": 3,
    "model.layers.62": 3,
    "model.layers.63": 3,
    "model.layers.64": 3,
    "model.layers.65": 3,
    "model.layers.66": 3,
    "model.layers.67": 3,
    "model.layers.68": 3,
    "model.layers.69": 3,
    "model.layers.70": 3,
    "model.layers.71": 3,
    "model.layers.72": 3,
    "model.layers.73": 3,
    "model.layers.74": 3,
    "model.layers.75": 3,
    "model.layers.76": 3,
    "model.layers.77": 3,
    "model.layers.78": 3,
    "model.layers.79": 3,
    "model.norm": 3,
    "lm_head": 3
}

@SunMarc

FanaticPythoner commented 6 months ago

Would anyone care to look at it please? Let it be @SunMarc or someone else? I highly suppose it's an HF compatibility issue with NVLink, but I can't say with 100% certainty.

sjsmith88 commented 5 months ago

We really would appreciate any help on this roadblock....much thx!

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate