Open TitleOS opened 2 months ago
System Specs: Ryzen 5600G Nvidia Tesla M40 24GB 128GB DDR4 RAM
Error:
running layers(cuda:0): 1%|▍ | 1/129 [00:06<14:44, 6.91s/it] Traceback (most recent call last): File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\inference_405B_4bit.py", line 14, in <module> generation_output = model.generate( ^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\transformers\generation\utils.py", line 2024, in generate result = self._sample( ^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\transformers\generation\utils.py", line 2982, in _sample outputs = self(**model_inputs, return_dict=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\airllm\airllm_base.py", line 369, in __call__ return self.forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\airllm\airllm_base.py", line 569, in forward new_seq = layer(seq, **kwargs)[0] ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 734, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Darkl\OneDrive\source\repos\NeuralSlice\NeuralSliceEnv\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 622, in forward key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: shape '[1, 5, 8, 128]' is invalid for input of size 10240
My inference code:
from airllm import AutoModel model = AutoModel.from_pretrained("unsloth/Meta-Llama-3.1-405B-Instruct-bnb-4bit", delete_original=True) input_text = input("Prompt the all mightly 405B Llama: ") input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=128, padding=False) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=10, return_dict_in_generate=True) output = model.tokenizer.decode(generation_output.sequences[0]) print(output)
PIP List:
(NeuralSliceEnv) C:\Users\Darkl\OneDrive\source\repos\NeuralSlice>pip list Package Version ------------------ ------------ accelerate 0.33.0 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aiosignal 1.3.1 airllm 2.10.2 attrs 24.2.0 bitsandbytes 0.43.3 certifi 2024.7.4 charset-normalizer 3.3.2 colorama 0.4.6 coloredlogs 15.0.1 datasets 2.21.0 dill 0.3.8 filelock 3.15.4 frozenlist 1.4.1 fsspec 2024.6.1 huggingface-hub 0.24.6 humanfriendly 10.0 idna 3.8 Jinja2 3.1.4 MarkupSafe 2.1.5 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.3 numpy 1.26.4 optimum 1.21.4 packaging 24.1 pandas 2.2.2 pillow 10.2.0 pip 24.2 protobuf 5.27.3 psutil 6.0.0 pyarrow 17.0.0 pyreadline3 3.4.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.2 regex 2024.7.24 requests 2.32.3 safetensors 0.4.4 scipy 1.14.1 sentencepiece 0.2.0 setuptools 65.5.0 six 1.16.0 sympy 1.13.2 tokenizers 0.19.1 torch 2.4.0+cu121 torchaudio 2.4.0+cu121 torchvision 0.19.0+cu121 tqdm 4.66.5 transformers 4.44.2 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 xxhash 3.5.0 yarl 1.9.4
same error, anyone had any luck?
I posted a potential quickfix on the other issue here
I can confirm this fix worked for me, and I was able to inference with LLaMA 405B 4 bit on my setup. Thank you!
System Specs: Ryzen 5600G Nvidia Tesla M40 24GB 128GB DDR4 RAM
Error:
My inference code:
PIP List: