Request to provide a RAG example?

ChenYuYeh commented 5 months ago

this is a cool project. I can make it run well on my meteorlake system.

btw, would you kindly provide a RAG (Retrieval Augmented Genration) example that can refer to external documents using RAG technique? Soon or later. Thanks.

Reference link: https://github.com/yas-sim/openvino-llm-chatbot-rag

alessandropalla commented 5 months ago

I think is a great idea! I was working on some splashy demos like Lora fine-tuning but this is a much lower hanging fruit. Given the nature of this library I think we should aim for lanchain smooth integration

ChenYuYeh commented 5 months ago

Exactly, go with langchain is the most popular RAG solution! Hope I can help validate soon! https://python.langchain.com/docs/expression_language/cookbook/retrieval

ChenYuYeh commented 5 months ago

@alessandropalla I managed to verify using NPU as device with this GitHub https://github.com/yas-sim/openvino-llm-chatbot-rag. However it turns out the error logs as it cannot support dynamic shape...

I was using 'dolly2-3b' and 'TinyLlama-1.1B-Chat-v1.0' both. Hence there should be not a memory issue.

File "/opt/intel/openvino/python/openvino/runtime/ie_api.py", line 543, in compile_model super().compile_model(model, device_name, {} if config is None else config), RuntimeError: Exception from src/inference/src/core.cpp:113: [ GENERAL_ERROR ] Exception from src/vpux_plugin/src/plugin.cpp:579: get_shape was called on a descriptor::Tensor with dynamic shape

Therefore I checked the openvino documents about NPU device. Mentioned it has certain limitations. https://docs.openvino.ai/2023.3/openvino_docs_OV_UG_supported_plugins_NPU.html

Currently, only the models with static shapes are supported on NPU. <---
Running the Alexnet model with NPU may result in a drop in accuracy. At this moment, the googlenet-v4 model is recommended for classification tasks.

Could you kindly verify whether this limitation is observed from your end? Thanks a lot.

output.log

CPU/iGPU works well.

alessandropalla commented 5 months ago

How did you enable the NPU inference on that repo? I suggest to edit Line 59 of openvino-rag-server.py with the following (like the llama.py example)

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

You should also import the proper libraries

from transformers import AutoModelForCausalLM
import intel_npu_acceleration_library

ChenYuYeh commented 5 months ago

Appreciate for your suggestions! I can use NPU for inferencing RAG now. Although both CPU and NPU are with extremely high loading. Do you know what the policy is for the resource assignment on CPU and NPU within your library?

alessandropalla commented 5 months ago

I'm trying to reverse eng. the script you provided. From that one it seems that the embedding model runs on CPU while the llm runs on mixed NPU/CPU. Then it really depends on the model used. Library default offload torch.nn.Linear layers and few others to NPU but we have ad-hoc optimizations for some networks. We want to add torch.nn.functional.scaled_dot_product_attention to offload to NPU and I'm working on that. Also the more you offload at the same time the better, as explained very well here.

The NNFactory class can be used to create custom graphs to offload to the NPU (example here for MLP) and is the next logical step for this library performance journey:

torch.compile -> fx graph -> subgraph extraction -> NPU

External contributions are very welcomed by the way if you want to implement some ad-hoc backend on NPU

ChenYuYeh commented 5 months ago

Not sure what I can contribute to this project. Hopefully you can give me some more hints. btw, I verified multiple times that the LLM powered RAG has performance with CPU (13words/s) > GPU (8words/s) > NPU (3 words/s). I wonder if it is proper to offload more loads to NPU. It seems not as you expected.

alessandropalla commented 5 months ago

Yes it is not expected, I'll dig deeper

ChenYuYeh commented 5 months ago

Hi @alessandropalla I would like to verify this library with CPU, GPU as inferencing devices as well for platforms without NPU. How to configure the compile API? And even wonder if it can also support AMD platforms. Thanks.

alessandropalla commented 5 months ago

Can you clarify what script are you using so I can debug performance? Also I'd like to know more about your usecases for heterogeneous compute so I can adapt the compile API to user's need.

Related to AMD support, it is not in our short term roadmap but I'm very sensitive of community needs so if you have a strong usecase I'd love to hear it

ChenYuYeh commented 5 months ago

Hi @alessandropalla,

Kindly ignore the title for request to support RAG. We shall focus on LLM performance at first.

Therefore no need new script/python code, I recommend just to use the llama.py to benchmark performance with xPU options. So far there is no parameter to choose the device with the API. Or this API/library is meant to use NPU mandatorily?

No hurry for AMD support. Thanks. :)

alessandropalla commented 5 months ago

Ok, I'll go to close the issue then. I'll keep you posted in the next releases

intel / intel-npu-acceleration-library

Request to provide a RAG example? #7