Open fakezeta opened 3 weeks ago
@alexsu52 , please take a look if it's NNCF issue or smth is in the run-time.
@fakezeta, could you provide an end-to-end reproducer and specification of hardware which you was used? Is the issue reproduce without {"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT","GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}
?
I've run the model https://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int4
on gsm8k using lm_eval
(https://github.com/EleutherAI/lm-evaluation-harness) and did not get any errors.
Hi @alexsu52,
I made some further test after your feedback. My HW is a i5 12600 with Intel(R) UHD Graphics 770 (iGPU).
I found that this error is given only if the inference device is GPU, with CPU everything works fine.
Test done both with and without {"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT","GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}
To check if it's a HW limitation of iGPU I ran the following code on Intel Dev Cloud with the following HW: Intel(R) Xeon(R) Platinum 8480L and Intel(R) Data Center GPU Max 1100
from transformers import AutoTokenizer, TextIteratorStreamer
from optimum.intel import OVModelForCausalLM
model_name="fakezeta/Phi-3-medium-4k-instruct-ov-int4"
chat = [
{"role": "user", "content": "Why the sky is blue?"},
]
device="GPU"
model = OVModelForCausalLM.from_pretrained(model_name, compile=True, device=device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextIteratorStreamer(tokenizer,
skip_prompt=True,
skip_special_tokens=True)
prompt=tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
generated = model.generate(**inputs,
max_new_tokens=250,
temperature=0.7,
top_p=0.9,
do_sample=True,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
use_cache=True,
)
generated_text = tokenizer.batch_decode(generated[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
print(generated_text)
Same behaviour here: working with CPU and error with GPU.
Traceback (most recent call last):
File "/home/ua611b2bb184b16fb93a386bc0643807/test-phi3.py", line 18, in <module>
generated = model.generate(**inputs,
File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/optimum/intel/openvino/modeling_decoder.py", line 659, in generate
result = super().generate(
File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 1758, in generate
result = self._sample(
File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 2437, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
File "/home/ua611b2bb184b16fb93a386bc0643807/.local/lib/python3.9/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 81, in wrapped
op1 = operator(*args, **kwargs)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
so it can be a problem of the GPU plugin.
@vladimir-paramuzov could you take a look?
@fakezeta, @alexsu52 Seems like fp16 overflow happens somewhere during inference. With ovconfig={"INFERENCE_PRECISION_HINT": "f32"}
it works fine:
The sky appears blue to the human eye because of the way sunlight interacts with Earth'ps atmosphere. Sunlight is made up of various colors of light, which together we perceive as white light. When sunlight enters Earth's atmosphere, it collides with molecules and small particles in the air.
The blue color of the sky is the result of a phenomenon called Rayleigh scattering. This occurs when the shorter (blue) wavelengths of light are scattered more than the longer (red) wavelengths by the small particles in the atmosphere. This scattering process causes the blue light to spread out and become more visible in all directions, giving the sky its blue color.
At sunrise and sunset, the sun's light has to pass through a thicker layer of the atmosphere, which scatters the blue and green wavelengths even more, leaving the longer (red and orange) wavelengths to dominate the sky's color. This is why the sky appears red, orange, or pink during sunrise and sunset.
In summary, the blue color of the sky is due to the scattering of sun
I confirm that with ovconfig={"INFERENCE_PRECISION_HINT": "f32"}
it works also on the good old iGPU and this explain why it works with the CPU plugin.
Thank you @vladimir-paramuzov, what is your suggested way forward?
@fakezeta , it means there is an issue in GPU plugin. @vladimir-paramuzov , is it a known problem? Do you have an issue for that? Otherwise I would forward this issue to https://github.com/openvinotoolkit/openvino/issues
transferred to openvino, assigned to @vladimir-paramuzov
🐛 Describe the bug
Hi,
Running Phi3 Medium on LocalAI with OpenVINO backend I found that while the int8 quantization is working correctly, the int4 quant gives the following error after few tokens are generated:
the models are
https://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int4
andhttps://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int8
Opening here since int8 it's working.
Environment
Python 3.10.12 Docker image based on Ubuntu 22.04.04 on i5 12600 with 48GB Ram Proxmox VM
Minimal Reproducible Example
Model definition for LocalAI
relevant code: Model is loaded with
While inference is done with:
Are you going to submit a PR?