llama3-8B causes MTL iGPU runtime error when ipex-llm's running AI inference

zcwang commented 2 months ago

Hello ipex-llm experts, I suffers issue about Llama-3-8B on MTL-H's iGPU and need any advice from you. :)

It seems to have issue with iGPU in MTL 155H but no issue with ARC770 in Ubuntu 22.04+kernel v6.8.2.

ARC770 works well

(llm-test) intel@mydevice:~/work/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3$ ONEAPI_DEVICE_SELECTOR=level_zero:0 python ./generate.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --prompt 'History of Intel' --n-predict 64
2024-05-13 14:56:26,831 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11.45it/s]
2024-05-13 14:56:27,298 - INFO - Converting the current model to sym_int4 format......
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Inference time: 1.4299554824829102 s
-------------------- Prompt --------------------
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

History of Intel<|eot_id|><|start_header_id|>assistant<|end_header_id|>

-------------------- Output (skip_special_tokens=False) --------------------
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

History of Intel<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The fascinating history of Intel!

Intel Corporation, one of the world's leading semiconductor companies, has a rich history that spans over six decades. Here's a brief overview:

**Early Years (1957-1969)**

Intel was founded on July 18, 1957, by Gordon Moore and Robert Noy

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 155H]
Registry and code: 13 MB
Command: python ./generate.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --prompt History of Intel --n-predict 64
Uptime: 12.174066 s

Issue with error "RuntimeError: probability tensor contains either inf, nan or element < 0"


(llm-test) intel@mydevice:~/work/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3$ ONEAPI_DEVICE_SELECTOR=level_zero:1 python ./generate.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --prompt 'History of Intel' --n-predict 64
2024-05-13 15:00:17,639 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11.61it/s]
2024-05-13 15:00:18,130 - INFO - Converting the current model to sym_int4 format......
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Traceback (most recent call last):
File "/home/intel/work/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/./generate.py", line 81, in <module>
output = model.generate(input_ids,
File "/home/intel/anaconda3/envs/rag-demo/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/intel/anaconda3/envs/rag-demo/lib/python3.9/site-packages/ipex_llm/transformers/lookup.py", line 87, in generate
return original_generate(self,
File "/home/intel/anaconda3/envs/rag-demo/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/intel/anaconda3/envs/rag-demo/lib/python3.9/site-packages/ipex_llm/transformers/speculative.py", line 109, in generate
return original_generate(self,
File "/home/intel/anaconda3/envs/rag-demo/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/intel/anaconda3/envs/rag-demo/lib/python3.9/site-packages/transformers/generation/utils.py", line 1520, in generate
return self.sample(
File "/home/intel/anaconda3/envs/rag-demo/lib/python3.9/site-packages/transformers/generation/utils.py", line 2653, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763) LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 155H] Registry and code: 13 MB Command: python ./generate.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --prompt History of Intel --n-predict 64 Uptime: 11.134912 s

Environment info

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000] [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 155H OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.13.29138.7] [opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.13.29138.7] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.29138] [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.29138]

intel_extension_for_pytorch 2.1.20+git0e2bee2 torch 2.1.0.post0+cxx11.abi torchvision 0.16.0+fbb4cc5 sentence-transformers 2.3.1 transformers 4.37.0 transformers-stream-generator 0.0.5

qiuxin2012 commented 2 months ago

Arc770 and iGPU can't working on the same env, we are still working on it, related issue: https://github.com/intel-analytics/ipex-llm/issues/10940 But the error is different, should be RuntimeError: could not create a primitive. This difference may be caused by your different torch version.

zcwang commented 2 months ago

Got it! I will remove ARC770 to test my iGPU again in MTL.

BTW I also test the same SW environment in my TGL platform (Corei7-1185G7) and the iGPU indeed works well.

SW environment in Ubuntu 22.04+kernel v6.8.2

intel_extension_for_pytorch   2.1.20+git0e2bee2
torch                         2.1.0.post0+cxx11.abi
torchvision                   0.16.0+fbb4cc5
intel-openmp                  2024.1.0
openvino                      2024.1.0
openvino-telemetry            2024.1.0

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [24.13.29138.7]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.29138]

Test result


(llm-test) intel@myDUT:~/work/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3$ ONEAPI_DEVICE_SELECTOR=level_zero:0 python ./generate.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --prompt 'History of Intel' --n-predict 64
2024-05-15 09:57:23,463 - INFO - intel_extension_for_pytorch auto imported
/home/intel/anaconda3/envs/llm-test/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.16it/s]
2024-05-15 09:57:25,302 - INFO - Converting the current model to sym_int4 format......
/home/intel/anaconda3/envs/llm-test/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Inference time: 9.984711408615112 s
-------------------- Prompt --------------------
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Intel Corporation is an American multinational corporation that specializes in the design and manufacture of microprocessors, memory chips, and other semiconductor technologies. Here is a brief history of the company:

Early Years (1968-1979)

Intel was founded on July 18, 1968, by Gordon Moore and Robert N



@qiuxin2012 , I appreciate your support.

zcwang commented 2 months ago

@qiuxin2012 . I confirmed MTL-H iGPU works well without ARC770 in platform.

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 155H OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.13.29138.7]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.29138]
...
(llm) intel@mydevice:~/work/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3$ ONEAPI_DEVICE_SELECTOR=level_zero:0 python ./generate.py --repo-id-or-model-path meta-ll
ama/Meta-Llama-3-8B-Instruct --prompt 'History of Intel' --n-predict 64
2024-05-15 10:36:33,547 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.48it/s]
2024-05-15 10:36:34,559 - INFO - Converting the current model to sym_int4 format......
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Inference time: 6.857227563858032 s
-------------------- Prompt --------------------
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

History of Intel<|eot_id|><|start_header_id|>assistant<|end_header_id|>

-------------------- Output (skip_special_tokens=False) --------------------
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

History of Intel<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The legendary Intel!

Intel Corporation is an American multinational corporation that specializes in the design and manufacture of microprocessors, the "brain" of modern computers. Here's a brief history of the company:

**Early Years (1968-1971)**

Intel was founded on July 18, 1968, by Gordon

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 155H]
Registry and code: 13 MB
Command: python ./generate.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --prompt History of Intel --n-predict 64
Uptime: 63.459550 s

intel-analytics / ipex-llm

llama3-8B causes MTL iGPU runtime error when ipex-llm's running AI inference #10999