Open Kpeacef opened 1 month ago
I met exact the same problem when run all-in-one benchmark of Llama-3.1 8B.
Hi, we are trying to reproduce your issue.
Hi, we've already reproduced your error. Will get back to you once we find a solution.
Hi @Kpeacef @lei-sun-intel ,
According to the device Runtime Error, we modified one line of code:
eos_token_mask = torch.isin(vocab_tensor, self.eos_token_id.to('xpu'))
at around #line 288 at
/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/transformers/generation/logits_process.py
So it would be:
@add_start_docstrings(LOGITS_PROCESSOR_INPUTS_DOCSTRING)
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
new_tokens_length = input_ids.shape[-1] - self.prompt_length_to_skip
scores_processed = scores.clone()
vocab_tensor = torch.arange(scores.shape[-1], device=scores.device)
eos_token_mask = torch.isin(vocab_tensor, self.eos_token_id.to('xpu'))
if new_tokens_length < self.min_new_tokens:
scores_processed = torch.where(eos_token_mask, -math.inf, scores)
return scores_processed
However, we still got a new error:
Traceback (most recent call last):
File "/home/arda/zijie/llama3.1/all-in-one/run.py", line 2003, in <module>
run_model(model, api, in_out_pairs, conf['local_model_hub'], conf['warm_up'], conf['num_trials'], conf['num_beams'],
File "/home/arda/zijie/llama3.1/all-in-one/run.py", line 152, in run_model
result = run_transformer_int4_fp16_gpu_win(repo_id, local_model_hub, in_out_pairs, warm_up, num_trials, num_beams, low_bit, cpu_embedding, batch_size, streaming)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arda/zijie/llama3.1/all-in-one/run.py", line 1126, in run_transformer_int4_fp16_gpu_win
output_ids = model.generate(input_ids, do_sample=False,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/ipex_llm/utils/benchmark_util.py", line 1563, in generate
return self.greedy_search(
^^^^^^^^^^^^^^^^^^^
File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/ipex_llm/utils/benchmark_util.py", line 2430, in greedy_search
model_kwargs = self._update_model_kwargs_for_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/ipex_llm/utils/benchmark_util.py", line 795, in _update_model_kwargs_for_generation
return self.model._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder, standardize_cache_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/transformers/generation/utils.py", line 699, in _update_model_kwargs_for_generation
model_kwargs["cache_position"] = model_kwargs["cache_position"][-1:] + num_new_tokens
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'cache_position'
It would probably be caused by the incompatibility between our benchmark_util.py and the new version of the transformers.
@Kpeacef @lei-sun-intel We have support Llama-3.1 in all-in-one yesterday, you should update your ipex-llm and run.py to latest version.
@qiuxin2012 , I have created another environment with 2.5.0b20240807.
Another issue of ValueError: rope_scaling
must be a dictionary with with two fields, type
and factor
, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
Issue is resolved by upgrading transformers, please update your transformers versions with "pip install --upgrade transfomers"
Tested transformers version 4.44.0
Hi @Kpeacef , we've already reproduced this error before. We have tested and successfully ran it on transformers version "4.43.1".
Hi I would like to try out Meta-Llama-3.1-8B with all-in-one benchmark.. seems to be I am facing this issue "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:0 and cpu!"
This is my pip list for your reference
Package Version
accelerate 0.23.0 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 attrs 23.2.0 bigdl-core-xe-21 2.5.0b20240726 bigdl-core-xe-addons-21 2.5.0b20240726 bigdl-core-xe-batch-21 2.5.0b20240726 certifi 2024.7.4 charset-normalizer 3.3.2 datasets 2.20.0 dill 0.3.8 docstring_parser 0.16 filelock 3.15.4 frozenlist 1.4.1 fsspec 2024.5.0 huggingface-hub 0.24.2 idna 3.7 intel-cmplr-lib-ur 2024.2.0 intel-extension-for-pytorch 2.1.10+xpu intel-openmp 2024.2.0 ipex-llm 2.1.0b20240726 Jinja2 3.1.4 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.3 numpy 1.26.4 omegaconf 2.3.0 packaging 24.1 pandas 2.2.2 pillow 10.4.0 pip 24.0 protobuf 5.28.0rc1 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pydantic 2.8.2 pydantic_core 2.20.1 Pygments 2.18.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.2rc1 regex 2024.7.24 requests 2.32.3 rich 13.7.1 safetensors 0.4.3 sentencepiece 0.2.0 setuptools 69.5.1 shtab 1.7.1 six 1.16.0 sympy 1.13.1 tabulate 0.9.0 tokenizers 0.19.1 torch 2.1.0a0+cxx11.abi torchvision 0.16.0a0+cxx11.abi tqdm 4.66.4 transformers 4.43.2 trl 0.9.6 typing_extensions 4.12.2 tyro 0.8.5 tzdata 2024.1 urllib3 2.2.2 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4