intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k stars 1.24k forks source link

All-in-one Meta-Llama-3.1-8B RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:0 and cpu! #11681

Open Kpeacef opened 1 month ago

Kpeacef commented 1 month ago

Hi I would like to try out Meta-Llama-3.1-8B with all-in-one benchmark.. seems to be I am facing this issue "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:0 and cpu!"

This is my pip list for your reference

Package Version


accelerate 0.23.0 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 attrs 23.2.0 bigdl-core-xe-21 2.5.0b20240726 bigdl-core-xe-addons-21 2.5.0b20240726 bigdl-core-xe-batch-21 2.5.0b20240726 certifi 2024.7.4 charset-normalizer 3.3.2 datasets 2.20.0 dill 0.3.8 docstring_parser 0.16 filelock 3.15.4 frozenlist 1.4.1 fsspec 2024.5.0 huggingface-hub 0.24.2 idna 3.7 intel-cmplr-lib-ur 2024.2.0 intel-extension-for-pytorch 2.1.10+xpu intel-openmp 2024.2.0 ipex-llm 2.1.0b20240726 Jinja2 3.1.4 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.3 numpy 1.26.4 omegaconf 2.3.0 packaging 24.1 pandas 2.2.2 pillow 10.4.0 pip 24.0 protobuf 5.28.0rc1 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pydantic 2.8.2 pydantic_core 2.20.1 Pygments 2.18.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.2rc1 regex 2024.7.24 requests 2.32.3 rich 13.7.1 safetensors 0.4.3 sentencepiece 0.2.0 setuptools 69.5.1 shtab 1.7.1 six 1.16.0 sympy 1.13.1 tabulate 0.9.0 tokenizers 0.19.1 torch 2.1.0a0+cxx11.abi torchvision 0.16.0a0+cxx11.abi tqdm 4.66.4 transformers 4.43.2 trl 0.9.6 typing_extensions 4.12.2 tyro 0.8.5 tzdata 2024.1 urllib3 2.2.2 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4

lei-sun-intel commented 1 month ago

I met exact the same problem when run all-in-one benchmark of Llama-3.1 8B.

lzivan commented 1 month ago

Hi, we are trying to reproduce your issue.

lzivan commented 1 month ago

Hi, we've already reproduced your error. Will get back to you once we find a solution.

lzivan commented 1 month ago

Hi @Kpeacef @lei-sun-intel ,

According to the device Runtime Error, we modified one line of code:

eos_token_mask = torch.isin(vocab_tensor, self.eos_token_id.to('xpu'))

at around #line 288 at

/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/transformers/generation/logits_process.py                                                                                                                                                

So it would be:

    @add_start_docstrings(LOGITS_PROCESSOR_INPUTS_DOCSTRING)
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        new_tokens_length = input_ids.shape[-1] - self.prompt_length_to_skip
        scores_processed = scores.clone()
        vocab_tensor = torch.arange(scores.shape[-1], device=scores.device)
        eos_token_mask = torch.isin(vocab_tensor, self.eos_token_id.to('xpu'))
        if new_tokens_length < self.min_new_tokens:
            scores_processed = torch.where(eos_token_mask, -math.inf, scores)

        return scores_processed

However, we still got a new error:

Traceback (most recent call last):
  File "/home/arda/zijie/llama3.1/all-in-one/run.py", line 2003, in <module>
    run_model(model, api, in_out_pairs, conf['local_model_hub'], conf['warm_up'], conf['num_trials'], conf['num_beams'],
  File "/home/arda/zijie/llama3.1/all-in-one/run.py", line 152, in run_model
    result = run_transformer_int4_fp16_gpu_win(repo_id, local_model_hub, in_out_pairs, warm_up, num_trials, num_beams, low_bit, cpu_embedding, batch_size, streaming)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arda/zijie/llama3.1/all-in-one/run.py", line 1126, in run_transformer_int4_fp16_gpu_win
    output_ids = model.generate(input_ids, do_sample=False,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/ipex_llm/utils/benchmark_util.py", line 1563, in generate
    return self.greedy_search(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/ipex_llm/utils/benchmark_util.py", line 2430, in greedy_search
    model_kwargs = self._update_model_kwargs_for_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/ipex_llm/utils/benchmark_util.py", line 795, in _update_model_kwargs_for_generation
    return self.model._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder, standardize_cache_format)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arda/miniforge3/envs/llm/lib/python3.11/site-packages/transformers/generation/utils.py", line 699, in _update_model_kwargs_for_generation
    model_kwargs["cache_position"] = model_kwargs["cache_position"][-1:] + num_new_tokens
                                     ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'cache_position'

It would probably be caused by the incompatibility between our benchmark_util.py and the new version of the transformers.

qiuxin2012 commented 1 month ago

@Kpeacef @lei-sun-intel We have support Llama-3.1 in all-in-one yesterday, you should update your ipex-llm and run.py to latest version.

Kpeacef commented 1 month ago

@qiuxin2012 , I have created another environment with 2.5.0b20240807.

Another issue of ValueError: rope_scaling must be a dictionary with with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

Issue is resolved by upgrading transformers, please update your transformers versions with "pip install --upgrade transfomers"

Tested transformers version 4.44.0

lzivan commented 1 month ago

Hi @Kpeacef , we've already reproduced this error before. We have tested and successfully ran it on transformers version "4.43.1".