intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.44k stars 1.24k forks source link

RWKV model load int4 model fail #10161

Closed juan-OY closed 3 months ago

juan-OY commented 6 months ago

Linux OS 22.04

  1. convert RWKV model to INT4 model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, optimize_model=True, trust_remote_code=True) model = model.to('xpu') model = BenchmarkWrapper(model, do_print=True)

    Load tokenizer

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    save_path = "./rwkv-4-world-7b-int4/" model.save_low_bit(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer are saved to {save_path}")

  2. load the converted int4 model, failed with below error: (RWKV-py310) a770@RPLP-A770:~/ouyang/rwkv/models$ python generate_rwkv4_7b.py /home/a770/.local/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functio, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpegorlibpnginstalled before buildingtorchvis warn( 2024-02-19 11:21:33,665 - INFO - intel_extension_for_pytorch auto imported **** loading rwkv-4-world-7b-int4 2024-02-19 11:21:33,731 - INFO - Converting the current model to sym_int4 format...... <class 'transformers.models.rwkv.modeling_rwkv.RwkvForCausalLM'> Can not read the prompt file, please check the file path. 2024-02-19 11:21:36,422 - WARNING - The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input'n reliable results. 2024-02-19 11:21:36,422 - WARNING - Setting pad_token_id to eos_token_id:0 for open-end generation. Traceback (most recent call last): File "/home/a770/ouyang/rwkv/models/generate_rwkv4_7b.py", line 91, in output = model.generate(input_ids, File "/home/a770/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/a770/ouyang/rwkv/models/benchmark_util.py", line 1563, in generate return self.greedy_search( File "/home/a770/ouyang/rwkv/models/benchmark_util.py", line 2385, in greedy_search outputs = self( File "/home/a770/ouyang/rwkv/models/benchmark_util.py", line 533, in call return self.model(*args, kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/a770/miniconda3/envs/RWKV-py310/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 791, in forward rwkv_outputs = self.rwkv( File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/a770/miniconda3/envs/RWKV-py310/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 642, in forward self._rescale_layers() File "/home/a770/miniconda3/envs/RWKV-py310/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 721, in _rescalelayers block.attention.output.weight.div(2 int(block_id // self.config.rescale_every)) RuntimeError: result type Float can't be cast to the desired output type Byte

juan-OY commented 6 months ago

Also found that 2.5.0b20240213 rwkv model loading at runtime is much slower than 2.5.0b20240204 about 4 min with 2.5.0b20240213, and 1 min with 2.5.0b20240204

leonardozcm commented 6 months ago

LinuxOS 22 The loading failed issue has been fixed in the attached pr.

Also found that 2.5.0b20240213 rwkv model loading at runtime is much slower than 2.5.0b20240204 about 4 min with 2.5.0b20240213, and 1 min with 2.5.0b20240204

Cann't reproduce this. My bigdl version is 2.5.0b20240218. On my desktop, for load_low_bit it only takes 1.5s, and for from_pretrained is takes 10.26s. And the time remains the same when I downgrade bigdl-llm to 2.5.0b20240204.

leonardozcm commented 6 months ago

Will fix in https://github.com/intel-analytics/BigDL/pull/10179

juan-OY commented 6 months ago

rwkv5 issue still exist , bigdl version: 2.5.0b20240221

2024-02-21 22:17:22,445 - INFO - Converting the current model to sym_int4 format...... <class 'transformers_modules.modeling_rwkv5.Rwkv5ForCausalLM'> Can not read the prompt file, please check the file path. Traceback (most recent call last): File "/home/a770/ouyang/rwkv/models/generate_rwkv5.py", line 96, in output = model.generate(input_ids, File "/home/a770/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/a770/ouyang/rwkv/models/benchmark_util.py", line 1613, in generate return self.sample( File "/home/a770/ouyang/rwkv/models/benchmark_util.py", line 2697, in sample outputs = self( File "/home/a770/ouyang/rwkv/models/benchmark_util.py", line 533, in call return self.model(*args, *kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/a770/.cache/huggingface/modules/transformers_modules/modeling_rwkv5.py", line 820, in forward rwkv_outputs = self.rwkv( File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/a770/.cache/huggingface/modules/transformers_modules/modeling_rwkv5.py", line 708, in forward hidden_states, state, attentions = block( File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/a770/.cache/huggingface/modules/transformers_modules/modeling_rwkv5.py", line 417, in forward attention, state = self.attention(self.ln1(hidden), state=state, use_cache=use_cache, seq_mode=seq_mode) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/a770/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/a770/.cache/huggingface/modules/transformers_modules/modeling_rwkv5.py", line 331, in forward rwkv, layer_state = rwkv_linear_attention( File "/home/a770/.cache/huggingface/modules/transformers_modules/modeling_rwkv5.py", line 232, in rwkv_linear_attention return rwkv_linear_attention_v5_cpu( File "/home/a770/.cache/huggingface/modules/transformers_modules/modeling_rwkv5.py", line 204, in rwkv_linear_attention_v5_cpu out = out @ ow RuntimeError: mat1 and mat2 shapes cannot be multiplied (50x4096 and 8912896x1)

juan-OY commented 6 months ago

Below error on RWKV5 is fixed in latest release 2.5.0b20240221 out = out @ ow RuntimeError: mat1 and mat2 shapes cannot be multiplied (50x4096 and 8912896x1)

The correct way to load is as below: model = AutoModelForCausalLM.load_low_bit(model_path, trust_remote_code=True, optimize_model=True) It failed if optimize_model=False

juan-OY commented 3 months ago

Resolved already.