[BUG] [GPU] Phi3 Medium int4 Runtime Error: probability tensor contains either `inf`, `nan` or element < 0

fakezeta commented 3 weeks ago

🐛 Describe the bug

Hi,

Running Phi3 Medium on LocalAI with OpenVINO backend I found that while the int8 quantization is working correctly, the int4 quant gives the following error after few tokens are generated:

12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr Exception in thread Thread-5 (generate):
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr Traceback (most recent call last):
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     self.run()
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/usr/lib/python3.10/threading.py", line 953, in run
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     self._target(*self._args, **self._kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     return func(*args, **kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 651, in generate
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     result = super().generate(
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     return func(*args, **kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     result = self._sample(
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2437, in _sample
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 81, in wrapped
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     op1 = operator(*args, **kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

the models are https://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int4 and https://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int8

Opening here since int8 it's working.

Environment

about-time==4.2.1
accelerate==0.31.0
aiohttp==3.9.5
aiosignal==1.3.1
alive-progress==3.1.5
annotated-types==0.7.0
async-timeout==4.0.3
attrs==23.2.0
autograd==1.6.2
bitsandbytes==0.43.1
certifi==2024.6.2
charset-normalizer==3.3.2
cma==3.2.2
coloredlogs==15.0.1
contourpy==1.2.1
cycler==0.12.1
datasets==2.14.4
deprecated==1.2.14
dill==0.3.7
filelock==3.15.4
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2024.6.0
future==1.0.0
grapheme==0.6.0
grpcio==1.64.0
huggingface-hub==0.23.4
humanfriendly==10.0
idna==3.7
inquirerpy==0.3.4
intel-extension-for-pytorch==2.1.30.post0
intel-extension-for-transformers==1.4.2
jinja2==3.1.4
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jstyleson==0.0.2
kiwisolver==1.4.5
markdown-it-py==3.0.0
markupsafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
natsort==8.4.0
networkx==3.3
neural-compressor==2.4.1
ninja==1.11.1.1
nncf==2.11.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
onnx==1.16.1
opencv-python-headless==4.10.0.84
openvino==2024.2.0
openvino-telemetry==2024.1.0
openvino-tokenizers==2024.2.0.0
optimum==1.20.0
optimum-intel==1.17.2
packaging==24.1
pandas==2.2.2
pfzy==0.3.4
pillow==10.3.0
prettytable==3.10.0
prompt-toolkit==3.0.47
protobuf==5.27.1
psutil==6.0.0
py-cpuinfo==9.0.0
pyarrow==16.1.0
pycocotools==2.0.8
pydantic==2.7.4
pydantic-core==2.18.4
pydot==2.0.0
pygments==2.18.0
pymoo==0.6.1.1
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
pyyaml==6.0.1
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rpds-py==0.18.1
safetensors==0.4.3
schema==0.7.7
scikit-learn==1.5.0
scipy==1.13.1
sentencepiece==0.2.0
setuptools==69.5.1
six==1.16.0
sympy==1.12.1
tabulate==0.9.0
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.19.1
torch==2.1.0.post2+cxx11.abi
tqdm==4.66.4
transformers==4.41.2
triton==2.3.1
typing-extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4

Python 3.10.12 Docker image based on Ubuntu 22.04.04 on i5 12600 with 48GB Ram Proxmox VM

Minimal Reproducible Example

Model definition for LocalAI

name: phi3-medium
backend: transformers
parameters:
  model: fakezeta/Phi-3-medium-4k-instruct-ov-int4
context_size: 4096
type: OVModelForCausalLM
template:
  use_tokenizer_template: true
stopwords:
- "<|end|>"
- "<|endoftext|>"

relevant code: Model is loaded with

ovconfig={"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT","GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}
self.model = OVModelForFeatureExtraction.from_pretrained(model_name, 
                                                                compile=True,
                                                                trust_remote_code=request.TrustRemoteCode,
                                                                ov_config=ovconfig, 
                                                                export=True,
                                                                device=device_map)

self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_safetensors=True)

While inference is done with:

streamer=TextIteratorStreamer(self.tokenizer,
                                        skip_prompt=True,
                                        skip_special_tokens=True)
config=dict(inputs,
                        max_new_tokens=max_tokens, 
                        temperature=request.Temperature, 
                        top_p=request.TopP,
                        top_k=request.TopK, 
                        do_sample=sample,
                        attention_mask=inputs["attention_mask"],
                        eos_token_id=self.tokenizer.eos_token_id,
                        pad_token_id=self.tokenizer.eos_token_id,
                        streamer=streamer,
                        stopping_criteria=criteria,
                        use_cache=True,
                        )
thread=Thread(target=self.model.generate, kwargs=config)
thread.start()

Are you going to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

MaximProshin commented 3 weeks ago

@alexsu52 , please take a look if it's NNCF issue or smth is in the run-time.

alexsu52 commented 3 weeks ago

@fakezeta, could you provide an end-to-end reproducer and specification of hardware which you was used? Is the issue reproduce without {"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT","GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}?

I've run the model https://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int4 on gsm8k using lm_eval (https://github.com/EleutherAI/lm-evaluation-harness) and did not get any errors.

fakezeta commented 3 weeks ago

Hi @alexsu52,

I made some further test after your feedback. My HW is a i5 12600 with Intel(R) UHD Graphics 770 (iGPU).

I found that this error is given only if the inference device is GPU, with CPU everything works fine. Test done both with and without {"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT","GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}

To check if it's a HW limitation of iGPU I ran the following code on Intel Dev Cloud with the following HW: Intel(R) Xeon(R) Platinum 8480L and Intel(R) Data Center GPU Max 1100

from transformers import AutoTokenizer, TextIteratorStreamer
from optimum.intel import OVModelForCausalLM

model_name="fakezeta/Phi-3-medium-4k-instruct-ov-int4"
chat = [
  {"role": "user", "content": "Why the sky is blue?"},
]
device="GPU"
model = OVModelForCausalLM.from_pretrained(model_name, compile=True, device=device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

streamer = TextIteratorStreamer(tokenizer,
                                    skip_prompt=True,
                                    skip_special_tokens=True)

prompt=tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
generated = model.generate(**inputs,
                        max_new_tokens=250,
                        temperature=0.7,
                        top_p=0.9,
                        do_sample=True,
                        eos_token_id=tokenizer.eos_token_id,
                        pad_token_id=tokenizer.eos_token_id,
                        use_cache=True,
                        )
generated_text = tokenizer.batch_decode(generated[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
print(generated_text)

Same behaviour here: working with CPU and error with GPU.

Traceback (most recent call last):
  File "/home/ua611b2bb184b16fb93a386bc0643807/test-phi3.py", line 18, in <module>
    generated = model.generate(**inputs,
  File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/optimum/intel/openvino/modeling_decoder.py", line 659, in generate
    result = super().generate(
  File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 1758, in generate
    result = self._sample(
  File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 2437, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
  File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 81, in wrapped
    op1 = operator(*args, **kwargs)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

so it can be a problem of the GPU plugin.

alexsu52 commented 3 weeks ago

@vladimir-paramuzov could you take a look?

vladimir-paramuzov commented 3 weeks ago

@fakezeta, @alexsu52 Seems like fp16 overflow happens somewhere during inference. With ovconfig={"INFERENCE_PRECISION_HINT": "f32"} it works fine:

The sky appears blue to the human eye because of the way sunlight interacts with Earth'ps atmosphere. Sunlight is made up of various colors of light, which together we perceive as white light. When sunlight enters Earth's atmosphere, it collides with molecules and small particles in the air.

The blue color of the sky is the result of a phenomenon called Rayleigh scattering. This occurs when the shorter (blue) wavelengths of light are scattered more than the longer (red) wavelengths by the small particles in the atmosphere. This scattering process causes the blue light to spread out and become more visible in all directions, giving the sky its blue color.

At sunrise and sunset, the sun's light has to pass through a thicker layer of the atmosphere, which scatters the blue and green wavelengths even more, leaving the longer (red and orange) wavelengths to dominate the sky's color. This is why the sky appears red, orange, or pink during sunrise and sunset.

In summary, the blue color of the sky is due to the scattering of sun

fakezeta commented 3 weeks ago

I confirm that with ovconfig={"INFERENCE_PRECISION_HINT": "f32"} it works also on the good old iGPU and this explain why it works with the CPU plugin.

Thank you @vladimir-paramuzov, what is your suggested way forward?

MaximProshin commented 1 week ago

@fakezeta , it means there is an issue in GPU plugin. @vladimir-paramuzov , is it a known problem? Do you have an issue for that? Otherwise I would forward this issue to https://github.com/openvinotoolkit/openvino/issues

MaximProshin commented 1 week ago

transferred to openvino, assigned to @vladimir-paramuzov

openvinotoolkit / openvino