Open ladi-pomsar opened 3 months ago
Doesn't seem to be case with flash attention-enabled ADA generation GPU, thus seems to be specific to lack of flash attention.
For anyone wondering about this, this is due to the fact that pad_token is not present in Llama's tokenizer_config.json. Something as simple as adding "pad_token": "<|eot_id|>" to the end of the json works.
For some reason (code branching?) this doesn't bother FA enabled GPUs/is fixed within that branch, but bother those that need to disable FA.
I see the same issue running meta-llama/Llama-3.1-8B-Instruct
and meta-llama/Llama-3.2-3B-Instruct
on Intel GPU with 2.4.0-intel-xpu
container:
$ docker run --rm --privileged --cap-add=sys_nice -e HF_TOKEN=xxx \
--device=/dev/dri --ipc=host --shm-size 1g --net host -v /home/dvrogozh/data:/data \
ghcr.io/huggingface/text-generation-inference:2.4.0-intel-xpu \
--model-id meta-llama/Llama-3.2-3B-Instruct --cuda-graphs 0 --port 8080
...
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 132, in Warmup
batch = self.model.batch_type.from_pb(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/causal_lm.py", line 120, in from_pb
tokenized_inputs = tokenizer(
File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3016, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3104, in _call_one
return self.batch_encode_plus(
File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3297, in batch_encode_plus
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2917, in _get_padding_truncation_strategies
if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.pad_token is None or self.pad_token_id < 0):
TypeError: '<' not supported between instances of 'NoneType' and 'int'
2024-11-20T21:07:58.767307Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: '<' not supported between instances of 'NoneType' and 'int'
Error: Backend(Warmup(Generation("'<' not supported between instances of 'NoneType' and 'int'")))
I believe these XPU containers should have attention, but it's different attention implementation vs. CUDA. So, it's still might be branch difference why XPU steps into that.
With XPU this also is a regression because I see this working fine with 2.3.0-intel-xpu
container:
$ docker run --rm --privileged --cap-add=sys_nice -e HF_TOKEN=xxx \
--device=/dev/dri --ipc=host --shm-size 1g --net host -v /home/dvrogozh/data:/data \
ghcr.io/huggingface/text-generation-inference:2.3.0-intel-xpu \
--model-id meta-llama/Llama-3.2-3B-Instruct --cuda-graphs 0 --port 8080
@sywangyi
Further, running again on Intel GPU, but with stock PyTorch this time. This variant definitely does not have attention (setup differs from docker xpu runs). There is behavior change coming after this PR in Transformers:
Without above commit original issue can be reproduced:
$ cd /path/to/transformers && git reset --hard 187439c3fa139b2102a874483e9f8f0cfa8e5557~1 && pip install -e .
$ text-generation-launcher --model-id meta-llama/Llama-3.2-3B-Instruct --cuda-graphs 0 --port 8080
...
2024-11-20T22:40:08.340828Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-11-20T22:40:08.343788Z ERROR text_generation_launcher: Method Warmup encountered an error.
...
File "/home/dvrogozh/git/huggingface/transformers/src/transformers/tokenization_utils_base.py", line 2922, in _get_padding_truncation_strategies
if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.pad_token is None or self.pad_token_id < 0):
TypeError: '<' not supported between instances of 'NoneType' and 'int'
While after the commit I pointed to, behavior changes and TGI fails earlier on model initialization:
$ cd /path/to/transformers && git reset --hard 187439c3fa139b2102a874483e9f8f0cfa8e5557 && pip install -e .
$ text-generation-launcher --model-id meta-llama/Llama-3.2-3B-Instruct --cuda-graphs 0 --port 8080
...
2024-11-20T22:41:02.635005Z ERROR text_generation_launcher: Error when initializing model
...
> File "/home/dvrogozh/git/huggingface/text-generation-inference/server/text_generation_server/server.py", line 268, in serve_inner
model = get_model_with_lora_adapters(
File "/home/dvrogozh/git/huggingface/text-generation-inference/server/text_generation_server/models/__init__.py", line 1336, in get_model_with_lora_adapters
model = get_model(
File "/home/dvrogozh/git/huggingface/text-generation-inference/server/text_generation_server/models/__init__.py", line 878, in get_model
return CausalLM.fallback(
File "/home/dvrogozh/git/huggingface/text-generation-inference/server/text_generation_server/models/causal_lm.py", line 634, in fallback
tokenizer.pad_token_id = model.config.eos_token_id
File "/home/dvrogozh/git/huggingface/transformers/src/transformers/tokenization_utils_base.py", line 1076, in __setattr__
raise ValueError(f"Cannot set a non-string value as the {key}")
ValueError: Cannot set a non-string value as the pad_token
Failure in the second case is here: https://github.com/dvrogozh/transformers/blob/187439c3fa139b2102a874483e9f8f0cfa8e5557/src/transformers/tokenization_utils_base.py#L1076
Printing also values which we get I have that __setattr__
was called as:
__setattr__('pad_token', ['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>'])
I.e. we have a list of tokens instead of a single string value. __setattr__
does not seem account for this case. I wonder whether this can also give us a clue on why we stepped into original issue? Maybe path with attention somehow correctly handles a list of tokens? @zucchini-nlp as an author of https://github.com/huggingface/transformers/pull/34461, and @ArthurZucker : maybe you can comment here on the behavior and suggest further debug steps?
https://github.com/huggingface/text-generation-inference/pull/2702 (has been merged) means to fix following issue
2024-11-21T00:40:07.383973Z WARN text_generation_launcher: Could not import Flash Attention enabled models: No module named 'triton'
which will lead to xpu to go into causal_lm.py instead of flash_causal_lm.py
you should use image ghcr.io/huggingface/text-generation-inference:latest-intel-xpu to avoid this issue. meta-llama/Llama-3.2-3B-Instruct should work in latest tgi xpu image
@sywangyi : thank you for pointing this out. I missed this warning. Indeed ghcr.io/huggingface/text-generation-inference:latest-intel-xpu
works for me. This also correlates with @ladi-pomsar assumption that this issue is specific to the cases when attention is not enabled. Do you have ideas what in attention path makes things work?
Basically, here is a simplified script to reproduce the issue. That's what TGI is doing around https://github.com/huggingface/text-generation-inference/blob/07bed530f7eaf2419ed0e755e0f24d7afd814a46/server/text_generation_server/models/causal_lm.py#L634
Script:
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')
print(f">>> tokenizer.pad_token_id={tokenizer.pad_token_id}")
print(f">>> model.config.pad_token_id={model.config.pad_token_id}")
print(f">>> model.config.eos_token_id={model.config.eos_token_id}")
tokenizer.pad_token_id = model.config.eos_token_id
Output:
Loading checkpoint shards: 100%|█████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.03it/s]
>>> tokenizer.pad_token_id=None
>>> model.config.pad_token_id=None
>>> model.config.eos_token_id=[128001, 128008, 128009]
Traceback (most recent call last):
File "/home/dvrogozh/tmp/e.py", line 11, in <module>
tokenizer.pad_token_id = model.config.eos_token_id
File "/home/dvrogozh/git/huggingface/transformers/src/transformers/tokenization_utils_base.py", line 1076, in __setattr__
raise ValueError(f"Cannot set a non-string value as the {key}")
ValueError: Cannot set a non-string value as the pad_token
Unfortunately, my knowledge of Transformers is not enough to say what's wrong and where.
Hm. I like this place in the attention path (which works): https://github.com/huggingface/text-generation-inference/blob/ab7ccf5bc3c84e07d0faf0d950421fcdc29743b5/server/text_generation_server/models/flash_causal_lm.py#L1261-L1263
This was introduced by the following PR:
@Narsil : do you recall details on the tokenizer._eos_token_ids
hack? Can you suggest how to resolve issue we observe on non-attention path in TGI and which can be reproduced by simple script with Transformers (see my last comment above)?
I have filed issue/question on Transformers side:
Basically, here is a simplified script to reproduce the issue. That's what TGI is doing around
Cool, thanks for a reproducer. I will check it out and will be commenting under the transformers issue
@sywangyi : thank you for pointing this out. I missed this warning. Indeed
ghcr.io/huggingface/text-generation-inference:latest-intel-xpu
works for me. This also correlates with @ladi-pomsar assumption that this issue is specific to the cases when attention is not enabled. Do you have ideas what in attention path makes things work?
I didn't post follow-up, but if you disable flash attention on newer NVIDIA generations through the TGI env variable USE_FLASH_ATTENTION=False, you are able to reproduce it there as well.
@zucchini-nlp : thank you for feedback. I've posted #2774 with the proposed fix.
if you disable flash attention on newer NVIDIA generations through the TGI env variable USE_FLASH_ATTENTION=False, you are able to reproduce it there as well.
@zucchini-nlp : indeed. After #2774 this case on NVidia start to work as well.
System Info
Hi everyone, when trying to update from Llama 3 8B Instruct to Llama 3.1 8B Instruct, I noticed a crash:
Deployment mode: Docker compose Container settings:
OS: Ubuntu 22.04.4 LTS Rust version: N/A Container version: sha256:b49037cef8d0c61ec022d4d7c5baad22357e34bce7970148a457a11f8f8d7e36 Model being used: meta-llama/Meta-Llama-3.1-8B-Instruct GPUs: 2x Volta V100 - hence disabled Flash attention
Information
Tasks
Reproduction
Expected behavior
Llama 3.1 8B Instruct should work