intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Issues track for running bigdl-llm on cpu/xpu with python3.10/3.11/3.12 #9270

Open liu-shaojun opened 1 year ago

liu-shaojun commented 1 year ago

Issue1 on xpu with python 3.10 [Fixed after releasing bigdl-core-xe and bigdl-core-xe-esimd for python 3.10]

on Arc14, I followed https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md to run example with BigDL-LLM on Intel GPUs

conda create -n llm-py310 python=3.10
conda activate llm-py310
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

bigdl-llm==2.4.0b20230810 is installed, it is weird that bigdl-core-xe and bigdl-core-xe-esimd are missing when python=3.10.

(llm-py310) arda@arda-arc14:~/shaojun/BigDL/python/llm$ pip list
Package                     Version
--------------------------- ------------------
accelerate                  0.24.0
bigdl-llm                   2.4.0b20230810
certifi                     2023.7.22
transformers                4.34.1

Then set env and run generate.py

source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
git clone https://github.com/intel-analytics/BigDL.git
cd BigDL/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH

I got the following error

(llm-py310) arda@arda-arc14:~/shaojun/BigDL/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2$  python ./generate.py --repo-id-or-model-path /home/arda/wangzhengjin/models/Llama-2-7b-chat-hf-bigdl
/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 34.53it/s]
/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
Traceback (most recent call last):
  File "/home/arda/shaojun/BigDL/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/./generate.py", line 66, in <module>
    tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2017, in from_pretrained
    return cls._from_pretrained(
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2249, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 141, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 171, in get_spm_processor
    model_pb2 = import_protobuf(f"The new behaviour of {self.__class__.__name__} (with `self.legacy = False`)")
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 43, in import_protobuf
    raise ImportError(PROTOBUF_IMPORT_ERROR.format(error_message))
ImportError:
The new behaviour of LlamaTokenizer (with `self.legacy = False`) requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.

After pip install protobuf and python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH I got the following error

(llm-py310) arda@arda-arc14:~/shaojun/BigDL/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2$ python ./generate.py --repo-id-or-model-path /home/arda/wangzhengjin/models/Llama-2-7b-chat-hf-bigdl
/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 34.94it/s]
/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
Traceback (most recent call last):
  File "/home/arda/shaojun/BigDL/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/./generate.py", line 73, in <module>
    output = model.generate(input_ids,
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 1606, in generate
    return self.greedy_search(
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 2454, in greedy_search
    outputs = self(
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1038, in forward
    outputs = self.model(
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 925, in forward
    layer_outputs = decoder_layer(
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/arda/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: llama_attention_forward_4_31() got an unexpected keyword argument 'padding_mask'

There is a transformers issue https://github.com/huggingface/transformers/issues/26755 related to this error, should we use a fixed specific transformers version?

Solution: https://github.com/intel-analytics/llm.cpp/pull/128

Jasonzzt commented 1 year ago

Issue1 on cpu with python 3.10

On spr-01, I followed all-in-one to run bigdl-llm on Intel CPU with python3.10.

conda create --name ziteng-310 python=3.10
conda activate ziteng-310
pip install omegaconf
pip install pandas
pip install --pre --upgrade bigdl-llm[all]
pip install bigdl-nano[pytorch]
source bigdl-nano-init

By running the above command, I got some conda environment configuration which are as follows

python                      3.10.13
bigdl-llm                   2.4.0b20231024
bigdl-nano                  2.3.0

I tested the models chatglm2-6b, Llama-2-7b-chat-hf, Baichuan2-7B-Chat, Llama-2-13b-chat-hf and mpt-7b-chat. Except for the chatglm2-6b model, all other models are fine to run bigdl-llm on cpu with python3.10. When running bigdl-llm on cpu with chatglm2-6b, I got the following problem

Traceback (most recent call last):
  File "/root/ziteng/BigDL/python/llm/dev/benchmark/all-in-one/./run.py", line 552, in <module>
    run_model(model, api, conf['in_out_pairs'], conf['local_model_hub'], conf['warm_up'], conf['num_trials'], conf['num_beams'], conf['low_bit'])
  File "/root/ziteng/BigDL/python/llm/dev/benchmark/all-in-one/./run.py", line 45, in run_model
    result = run_transformer_int4(repo_id, local_model_hub, in_out_pairs, warm_up, num_trials, num_beams, low_bit)
  File "/root/ziteng/BigDL/python/llm/dev/benchmark/all-in-one/./run.py", line 144, in run_transformer_int4
    model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, trust_remote_code=True,
  File "/root/anaconda3/envs/ziteng-310/lib/python3.10/site-packages/bigdl/llm/transformers/model.py", line 97, in from_pretrained
    model = cls.load_convert(q_k, optimize_model, *args, **kwargs)
  File "/root/anaconda3/envs/ziteng-310/lib/python3.10/site-packages/bigdl/llm/transformers/model.py", line 120, in load_convert
    model = cls.HF_Model.from_pretrained(*args, **kwargs)
  File "/root/anaconda3/envs/ziteng-310/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 496, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.chatglm2-6b.configuration_chatglm.ChatGLMConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

Running on version 3.9 also encountered this issue. It seemed that this was a problem with the model itself. A related problem was found in Chatglm's issue. https://github.com/THUDM/ChatGLM-6B/issues/37#issuecomment-1704036282 The trainer of huggingface tends to save only the model rather than both model with tokenizer.

Aside from this issue, There is no other problem running bigdl-llm on cpu with python 3.10.

Jasonzzt commented 1 year ago

It runs ok when I load THUDM/chatglm2-6b from Huggingface. For local model, please refer to https://github.com/THUDM/ChatGLM-6B/issues/37#issuecomment-1704036282

liu-shaojun commented 1 year ago

Issue2 mpt-7b-chat on xpu with python 3.10 [Fixed with unset SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS]

on Arc05, I followed https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md to run mpt-7b-chat with BigDL-LLM on Intel GPUs, got the following error:

(llm-py310) arda@arda-arc05:~/shaojun/BigDL/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt$ python ./generate.py --repo-id-or-model-path /mnt/disk1/models/mpt-7b-chat/
/opt/anaconda3/envs/llm-py310/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Instantiating an MPTForCausalLM model from /home/arda/.cache/huggingface/modules/transformers_modules/modeling_mpt.py
You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]
2023-10-30 07:27:39,352 - bigdl.llm.transformers.utils - INFO - Converting the current model to sym_int4 format......
/opt/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/generation/utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
Traceback (most recent call last):
  File "/home/arda/shaojun/BigDL/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/./generate.py", line 80, in <module>
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
  File "/opt/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3754, in decode
    return self._decode(
  File "/opt/anaconda3/envs/llm-py310/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 593, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

Running on version 3.9 also encountered this issue. Solution: https://github.com/analytics-zoo/nano/issues/661

liu-shaojun commented 1 year ago

The following models had been tested for python3.10 on Arc05/xpu, the output is as expected as described in https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/:

Llama-2-7b-chat-hf meta-llama/Llama-2-13b-chat-hf BAAI/AquilaChat-7B baichuan-inc/Baichuan-13B-Chat baichuan-inc/Baichuan2-7B-Chat THUDM/chatglm2-6b LinkSoul/Chinese-Llama-2-7b databricks/dolly-v1-6b google/flan-t5-xxl mistralai/Mistral-7B-Instruct-v0.1 internlm/internlm-chat-7b-8k

liu-shaojun commented 1 year ago

The following models had been tested for python3.11 on Arc13/xpu, the output is as expected as described in https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/:

baichuan-inc/Baichuan2-13B-Chat
meta-llama/Llama-2-7b-chat-hf
meta-llama/Llama-2-13b-chat-hf
Qwen/Qwen-7B-Chat
Qwen/Qwen-14B-Chat Qwen/Qwen-VL-Chat
databricks/dolly-v1-6b
databricks/dolly-v2-12b databricks/dolly-v2-7b
internlm/internlm-chat-20b
internlm/internlm-chat-7b-8k
THUDM/chatglm2-6b

liu-shaojun commented 1 year ago

Currently Python 3.12 is not supported for bigdl-llm since the dependency intel_extension_for_pytorch==2.0.110+xpu only supports up to python 3.11.

Zephyr596 commented 1 year ago

The following models had been tested on CPU with python 3.11

set following parameters according to the actual specs of the test machine

numactl -C 0-47 -m 0 python $(dirname "$0")/run.py