Closed FanaticPythoner closed 4 months ago
Hi @FanaticPythoner, thanks for the detailed report ! This is indeed strange that sequential works while it fails with auto (using "balanced"). Could you check what is the output of model.hf_device_map
? Maybe you can try to allocate each layer to a specific device or set the max_memory
arg when using sequential
.
@SunMarc
For the code:
import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)
Doing:
import json
print(json.dumps(pipe.model.hf_device_map, indent=4))
Prints:
{
"transformer.wte": 0,
"lm_head": 0,
"transformer.wpe": 0,
"transformer.drop": 0,
"transformer.h.0": 0,
"transformer.h.1": 0,
"transformer.h.2": 0,
"transformer.h.3": 0,
"transformer.h.4": 0,
"transformer.h.5": 0,
"transformer.h.6": 0,
"transformer.h.7": 0,
"transformer.h.8": 0,
"transformer.h.9": 0,
"transformer.h.10": 0,
"transformer.h.11": 0,
"transformer.h.12": 0,
"transformer.h.13": 1,
"transformer.h.14": 1,
"transformer.h.15": 1,
"transformer.h.16": 1,
"transformer.h.17": 1,
"transformer.h.18": 1,
"transformer.h.19": 1,
"transformer.h.20": 1,
"transformer.h.21": 1,
"transformer.h.22": 1,
"transformer.h.23": 1,
"transformer.h.24": 1,
"transformer.h.25": 1,
"transformer.h.26": 1,
"transformer.h.27": 1,
"transformer.h.28": 2,
"transformer.h.29": 2,
"transformer.h.30": 2,
"transformer.h.31": 2,
"transformer.h.32": 2,
"transformer.h.33": 2,
"transformer.h.34": 2,
"transformer.h.35": 2,
"transformer.h.36": 2,
"transformer.h.37": 2,
"transformer.h.38": 2,
"transformer.h.39": 2,
"transformer.ln_f": 2
}
Furthermore, in our current codebase, we have several different mechanism that handle model balancing. Changing device_map="auto"
to device_map="sequential"
is much more time consuming to do than if it would be in a small scale project. My team and I would highly appreciate if this issue could be considered high priority, given that it breaks the entire system, and I'm sure we won't be alone experiencing it.
And what is the model.hf_device_map
with device_map="sequential"
?
@SunMarc
For the code:
import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)
Doing:
import json
print(json.dumps(pipe.model.hf_device_map, indent=4))
Prints:
{
"": 0
}
Let me send you the same code for comparison, but instead of starchat, using mixtral, which is larger.
Oh makes sense why it works. It is because the model fits in a single gpu in the case of starchat. Yeah, let's check for mixtral.
@SunMarc
For the code:
import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)
Doing:
import json
print(json.dumps(pipe.model.hf_device_map, indent=4))
Prints:
{
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
"model.layers.2": 0,
"model.layers.3": 0,
"model.layers.4": 0,
"model.layers.5": 0,
"model.layers.6": 0,
"model.layers.7": 0,
"model.layers.8": 0,
"model.layers.9": 0,
"model.layers.10": 1,
"model.layers.11": 1,
"model.layers.12": 1,
"model.layers.13": 1,
"model.layers.14": 1,
"model.layers.15": 1,
"model.layers.16": 1,
"model.layers.17": 1,
"model.layers.18": 1,
"model.layers.19": 1,
"model.layers.20": 1,
"model.layers.21": 2,
"model.layers.22": 2,
"model.layers.23": 2,
"model.layers.24": 2,
"model.layers.25": 2,
"model.layers.26": 2,
"model.layers.27": 2,
"model.layers.28": 2,
"model.layers.29": 2,
"model.layers.30": 2,
"model.layers.31": 2,
"model.norm": 2,
"lm_head": 2
}
Now, for the code:
import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)
Doing:
import json
print(json.dumps(pipe.model.hf_device_map, indent=4))
Prints:
{
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
"model.layers.2": 0,
"model.layers.3": 0,
"model.layers.4": 0,
"model.layers.5": 0,
"model.layers.6": 0,
"model.layers.7": 0,
"model.layers.8": 0,
"model.layers.9": 0,
"model.layers.10": 0,
"model.layers.11": 0,
"model.layers.12": 0,
"model.layers.13": 0,
"model.layers.14": 0,
"model.layers.15": 0,
"model.layers.16": 1,
"model.layers.17": 1,
"model.layers.18": 1,
"model.layers.19": 1,
"model.layers.20": 1,
"model.layers.21": 1,
"model.layers.22": 1,
"model.layers.23": 1,
"model.layers.24": 1,
"model.layers.25": 1,
"model.layers.26": 1,
"model.layers.27": 1,
"model.layers.28": 1,
"model.layers.29": 1,
"model.layers.30": 1,
"model.layers.31": 1,
"model.norm": 1,
"lm_head": 1
}
I also looked at the result of outputs
. For "auto"
, it doesn't even finish, throwing a nan/inf error. For "sequential"
, it does expected behavior, i.e., answers correctly.
Yes, in the above code, it uses the starchat template... Still works.
It is probably a communication issue with your GPUs. I see that in "sequential", only two gpus are used. Maybe one quick way to solve this would be to run this model on only the first 2 GPUS by specifying CUDA_VISIBLE_DEVICES=0,1
. you can try to check in which layers the generation starts to output gliberish too.
The hardware/drivers has/have been triple checked by the bare metal provider. On my 3x 3090 setup, I don't use NVLink. Maybe that's the key. Or maybe it's something else.
As an update, I tested both Sequential and Auto on llama 3 70b, in bfloat16. Both are unable to run the inference and throw:
Exception has occurred: RuntimeError
probability tensor contains either `inf`, `nan` or element < 0
File "/root/hwsrc/project_name/main.py", line 16, in <module>
output = pipe("Hey how are you doing today?")
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Here are the device maps and the code that was used.
import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"
pipe = transformers.pipeline("text-generation",
model=model_id,
model_kwargs={
"torch_dtype": torch.bfloat16,
"max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
},
device_map="sequential")
print(json.dumps(pipe.model.hf_device_map, indent=4))
output = pipe("Hey how are you doing today?")
print(output)
outputs:
{
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
"model.layers.2": 0,
"model.layers.3": 0,
"model.layers.4": 0,
"model.layers.5": 0,
"model.layers.6": 0,
"model.layers.7": 0,
"model.layers.8": 0,
"model.layers.9": 0,
"model.layers.10": 0,
"model.layers.11": 0,
"model.layers.12": 0,
"model.layers.13": 0,
"model.layers.14": 0,
"model.layers.15": 0,
"model.layers.16": 0,
"model.layers.17": 0,
"model.layers.18": 0,
"model.layers.19": 0,
"model.layers.20": 0,
"model.layers.21": 0,
"model.layers.22": 0,
"model.layers.23": 1,
"model.layers.24": 1,
"model.layers.25": 1,
"model.layers.26": 1,
"model.layers.27": 1,
"model.layers.28": 1,
"model.layers.29": 1,
"model.layers.30": 1,
"model.layers.31": 1,
"model.layers.32": 1,
"model.layers.33": 1,
"model.layers.34": 1,
"model.layers.35": 1,
"model.layers.36": 1,
"model.layers.37": 1,
"model.layers.38": 1,
"model.layers.39": 1,
"model.layers.40": 1,
"model.layers.41": 1,
"model.layers.42": 1,
"model.layers.43": 1,
"model.layers.44": 1,
"model.layers.45": 1,
"model.layers.46": 1,
"model.layers.47": 1,
"model.layers.48": 1,
"model.layers.49": 2,
"model.layers.50": 2,
"model.layers.51": 2,
"model.layers.52": 2,
"model.layers.53": 2,
"model.layers.54": 2,
"model.layers.55": 2,
"model.layers.56": 2,
"model.layers.57": 2,
"model.layers.58": 2,
"model.layers.59": 2,
"model.layers.60": 2,
"model.layers.61": 2,
"model.layers.62": 2,
"model.layers.63": 2,
"model.layers.64": 2,
"model.layers.65": 2,
"model.layers.66": 2,
"model.layers.67": 2,
"model.layers.68": 2,
"model.layers.69": 2,
"model.layers.70": 2,
"model.layers.71": 2,
"model.layers.72": 2,
"model.layers.73": 2,
"model.layers.74": 2,
"model.layers.75": 3,
"model.layers.76": 3,
"model.layers.77": 3,
"model.layers.78": 3,
"model.layers.79": 3,
"model.norm": 3,
"lm_head": 3
}
import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"
pipe = transformers.pipeline("text-generation",
model=model_id,
model_kwargs={
"torch_dtype": torch.bfloat16,
"max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
},
device_map="auto")
print(json.dumps(pipe.model.hf_device_map, indent=4))
output = pipe("Hey how are you doing today?")
print(output)
outputs:
{
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
"model.layers.2": 0,
"model.layers.3": 0,
"model.layers.4": 0,
"model.layers.5": 0,
"model.layers.6": 0,
"model.layers.7": 0,
"model.layers.8": 0,
"model.layers.9": 0,
"model.layers.10": 0,
"model.layers.11": 0,
"model.layers.12": 0,
"model.layers.13": 0,
"model.layers.14": 0,
"model.layers.15": 0,
"model.layers.16": 0,
"model.layers.17": 0,
"model.layers.18": 0,
"model.layers.19": 1,
"model.layers.20": 1,
"model.layers.21": 1,
"model.layers.22": 1,
"model.layers.23": 1,
"model.layers.24": 1,
"model.layers.25": 1,
"model.layers.26": 1,
"model.layers.27": 1,
"model.layers.28": 1,
"model.layers.29": 1,
"model.layers.30": 1,
"model.layers.31": 1,
"model.layers.32": 1,
"model.layers.33": 1,
"model.layers.34": 1,
"model.layers.35": 1,
"model.layers.36": 1,
"model.layers.37": 1,
"model.layers.38": 1,
"model.layers.39": 1,
"model.layers.40": 2,
"model.layers.41": 2,
"model.layers.42": 2,
"model.layers.43": 2,
"model.layers.44": 2,
"model.layers.45": 2,
"model.layers.46": 2,
"model.layers.47": 2,
"model.layers.48": 2,
"model.layers.49": 2,
"model.layers.50": 2,
"model.layers.51": 2,
"model.layers.52": 2,
"model.layers.53": 2,
"model.layers.54": 2,
"model.layers.55": 2,
"model.layers.56": 2,
"model.layers.57": 2,
"model.layers.58": 2,
"model.layers.59": 2,
"model.layers.60": 2,
"model.layers.61": 3,
"model.layers.62": 3,
"model.layers.63": 3,
"model.layers.64": 3,
"model.layers.65": 3,
"model.layers.66": 3,
"model.layers.67": 3,
"model.layers.68": 3,
"model.layers.69": 3,
"model.layers.70": 3,
"model.layers.71": 3,
"model.layers.72": 3,
"model.layers.73": 3,
"model.layers.74": 3,
"model.layers.75": 3,
"model.layers.76": 3,
"model.layers.77": 3,
"model.layers.78": 3,
"model.layers.79": 3,
"model.norm": 3,
"lm_head": 3
}
@SunMarc
Would anyone care to look at it please? Let it be @SunMarc or someone else? I highly suppose it's an HF compatibility issue with NVLink, but I can't say with 100% certainty.
We really would appreciate any help on this roadblock....much thx!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
System specs:
This:
Prints:
[{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|> to. Air1 (\nь\nInfo plit An che a\n the weьь Share Share\n aremobatar\n…We brain be S jj jj'..., … …: no J\n,…AL more of… y they code lifefl\n -- B moreand.. L\nplitahph a after\n Ishare, E I I is L\n unel not Mid' I'’ …\n\n …" you a a South strength I I S said "no\n\n\n E E11\n EASC not Sh English. of of E |isse\n as that said said of said reg of The The– n a… Open. The The for | A after After\n was M open open over in been\n\n into,onAR down :-)mad cos I you to E,( not "a001 that vis m44\n\n\n of3\n re1 T by so itack in inententancy of is int Library to U U.. a a = ==Compression Itdata66 as111110 S'}]
While this:Prints:
[{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|>\nThere are multiple ways to sort a list in Python. One of the most common ways is to use the sort() method. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nmy_list.sort()\nprint(my_list)\n```\n\nThis will sort the list in place and print the sorted list.\n\nAnother way to sort a list is to use the sorted() function. This function returns a new sorted list and does not modify the original list. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nsorted_list = sorted(my_list)\nprint(sorted_list)\n```\n\nIn this example, the sorted_list variable will contain the sorted list and the my_list variable will remain unchanged.\n\nThere are also other sorting algorithms available in the built-in sort module, such as quicksort, heapsort, and merge sort. You'}]
Expected behavior
Both should print coherent text. This happens no matter the model chosen. In the above reproduction steps, the model used is
HuggingFaceH4/starchat-beta
. The exact same thing happens withmistralai/Mixtral-8x7B-Instruct-v0.1
, no matter if ran in bfloat16, float16, float32, or quantized / not quantized. The issue also occurs no matter the prompt.The issue, however, does NOT occur when device_map="sequential" is set (tested with HuggingFaceH4/starchat-beta only).
Furthermore, the issue does NOT occur with device_map="auto" on my home 3x RTX 3090 / Threadripper 3960x setup.
However, I cannot use sequential in our current production environment without making significant changes.