openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.82k stars 2.17k forks source link

Issue when running chat_sample on Intel GPU #26057

Open liuxt670 opened 1 month ago

liuxt670 commented 1 month ago

Hi, we encountered some issues while running your sample on Intel GPU. This is the model we are using: https://huggingface.co/internlm/internlm2-chat-1_8b/tree/main We convert this model to int4 OpenVino format and run on Intel GPU.

Here is the code we running, just made some modifications to your sample code:

ov::genai::LLMPipeline pipe(model_path, "GPU");

ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
config.do_sample = false;
std::function<bool(std::string)> streamer = [](std::string word) { 
    std::cout << word << std::flush;
    // Return flag corresponds whether generation should be stopped.
    // false means continue generation.
    return false; 
};

for (int i = 0; i < 10; ++i) {
    pipe.start_chat();
    std::string prompt = "Hi! Can you tell me a story about little cat?";

    pipe.generate(prompt, config, streamer);

    std::cout << "\n----------\n";
    pipe.finish_chat();
}

In the code above, we set config.do_sample=false and using pipe.finish_chat() to clear kv_cache after each round of generation. Ideally, we should get exactly the same result in each round of generation, however, we found that during each run, the result of the first round is always random, while the others are the same. It seems that the do_sample=false setting did not work for the first round of generation. Here are some results: 企业微信截图_17236041001955 企业微信截图_17236041603576 Do you have any suggestions on this issue? Really thanks!

Wovchena commented 1 month ago
  1. How do I convert your model? optimum-cli export openvino --trust-remote-code --model internlm/internlm2-chat-1_8b --weight-format int4 internlm2-chat-1_8b fails for optimum==1.21.2 and optimum-intel==1.18.1 with
      File "C:\Users\vzlobin\AppData\Roaming\Python\Python312\site-packages\optimum\exporters\openvino\model_patcher.py", line 1117, in _internlm2_attention_forward
        kv_seq_len += past_key_value[0].shape[-2]
                      ^^^^^^^^^^^^^^^^^^^^^^^
    AttributeError: 'tuple' object has no attribute 'shape'
  2. Is it integrated or discrete GPU?
  3. Does CPU have this issue?
  4. Try upgrading OpenVINO and GenAI to nightly versions: https://docs.openvino.ai/2024/get-started/install-openvino.html?PACKAGE=OPENVINO_GENAI&VERSION=NIGHTLY&OP_SYSTEM=WINDOWS&DISTRIBUTION=ARCHIVE
liuxt670 commented 1 month ago

@Wovchena Thanks for the reply.

  1. We are using optimum==1.21.2 and optimum-intel==1.19.0.dev0+9ef6766, and this is the command we use to convert the model:optimum-cli export openvino --task text-generation-with-past --model internlm2-chat-1_8b/ --weight-format int4 --trust-remote-code internlm2_openvino
  2. It's integrated GPU
  3. CPU doesn't have this issue
  4. We have tried several versions from nightly-20240726 to nightly-202408012, and got the same issue with all of them.
Wovchena commented 1 month ago

pip install git+https://github.com/huggingface/optimum-intel.git@9ef6766 solved the conversion problem. Since CPU is fine, it's not do_sample=false to blame. It's a dGPU problem. I transferred the issue to the main repo.