RAG is slow in ChatQnA demo on Xeon

opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.

https://opea.dev

Apache License 2.0

219 stars 134 forks source link

RAG is slow in ChatQnA demo on Xeon #584

Open NeoZhangJianyu opened 1 month ago

NeoZhangJianyu commented 1 month ago

I setup the demo based on ChatQnA (TGI) on Xeon (GNR). Try RAG by the UI. After upload the PDF file (2-5M), I search a question. It will take 10-15s.

When update a text file with 3 lines, it's 2-3s.

Customer find the slow issue on embedding stage.

Zhenzhong1 commented 4 weeks ago

@NeoZhangJianyu Hi

Which version of the TEI did you use for embedding? Did you implement reranking? If so, was it done on CPU or HPU?

Additionally, could you provide details about all images you used and how you built your ChatQnA pipeline?

NeoZhangJianyu commented 3 weeks ago

I use the docker image built from the dockerfiles in this project. comment id: 3913c7bb3629b2964ff1ddf38a2e4b5359ea43bb

I don't implement reranking. The docker build follow the script/cmd in guide in ChatQnA: https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/docker/xeon/README.md

Zhenzhong1 commented 3 weeks ago

@NeoZhangJianyu

Embedding should work well on GNR. Later we will apply Neural Speed (internal) to speed up embedding serving.

Also, we will investigate this issue.

Zhenzhong1 commented 1 week ago

@NeoZhangJianyu @lvliang-intel

Hello, we’ve evaluated the embedding performance.

Our testing data shows a significant performance gap between TEI 1.2 and TEI 1.5. Please verify which TEI version you are currently using.

In OPEA v0.9, TEI performance should not be an issue.

NeoZhangJianyu commented 2 days ago

It's great!