opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
216 stars 132 forks source link

ChatQnA on Xeon Docker Implementation - Embedding into Vector DB Failing #394

Open mandalrajiv opened 1 month ago

mandalrajiv commented 1 month ago

I am trying the ChatQnA GenAIExample on docker in Xeon. I am uploading the document https://docs.aws.amazon.com/pdfs/whitepapers/latest/optimizing-postgresql-on-ec2-using-ebs/optimizing-postgresql-on-ec2-using-ebs.pdf?did=wp_card&trk=wp_card. This is a public whitepaper published by AWS. The embedding into vector DB is failing. The docker logs for the dataprep-redis-server service shows Parsing the document. But I do not see a document upload successful message.

Can you please check why this document upload is failing?

mandalrajiv commented 1 month ago

Any update on this issue? We have a customer engagement where we want to show demo of OPEA ChantQnA. Without this issue being fixed, we are not able to show the demo to the customer.

lkk12014402 commented 1 month ago

Any update on this issue? We have a customer engagement where we want to show demo of OPEA ChantQnA. Without this issue being fixed, we are not able to show the demo to the customer.

hi, would you like to share the error log here?

mandalrajiv commented 1 month ago

Do you need the docker logs? Or some other error logs as well?

lkk12014402 commented 1 month ago

Do you need the docker logs? Or some other error logs as well?

The docker logs of the dataprep-redis-server service. I want to see the what errors occurred.

mandalrajiv commented 1 month ago

Here are the docker logs of the dataprep-redis-server service. In the bottom of the log fle, it says "Parsing document ./uploaded_files/optimizing-postgresql-on-ec2-using-ebs.pdf.". If the document upload is successful, we typically see something like and "upload successful" message"

dataprep-redis-server container logs:

Added as an attachment. retriever_log.txt

eero-t commented 1 month ago

The last warnings in the log, about things being not any more supported, look quite suspicious.

Btw. @mandalrajiv It's better to provide such (long) log files as attachments (instead of pasting them as inline comments), to keep the ticket readable.

mandalrajiv commented 1 month ago

The last warnings in the log, about things being not any more supported, look quite suspicious.

Btw. @mandalrajiv It's better to provide such (long) log files as attachments (instead of pasting them as inline comments), to keep the ticket readable.

Thank you. I have updated the comment to include the log as a file attachment.

eero-t commented 1 month ago

There's odd warning about invalid HTTP request, and I'm not sure how to interpret what your log is about, as there seem to be multiple logs, interrupted in middle?

WARNING: Invalid HTTP request received.
Using CPU. Note: This module is much faster with a GPU.
Downloading detection model, please wait. This may take several minutes depending upon your network connection.
files:UploadFile(filename='optimizing-postgresql-on-ec2-using-ebs.pdf', size=628163, headers=Headers({'content-disposition': 'form-data; name="files"; filename="optimizing-postgresql-on-ec2-using-ebs.pdf"', 'content-type': 'application/pdf'}))
link_list:None
Parsing document ./uploaded_files/optimizing-postgresql-on-ec2-using-ebs.pdf.
Progress: |███████████████████████████████  Downloading recognition model, please wait. This may take several minutes depending upon your network connection.
Progress: |████████████root@ip-172-31-26-186:/home/ubuntu/GenAIExamples/ChatQnA/docker/xeon# clear
root@ip-172-31-26-186:/home/ubuntu/GenAIExamples/ChatQnA/docker/xeon# docker logs dataprep-redis-server

PDF doc itself doesn't seem large, just 616KiB / 37 page.

I haven't tried dataprep service myself (I'm not OPEA dev), but is the service terminating abnormally during document upload, or is it stuck on the upload?

What about the recognization model, which one you're using?

mandalrajiv commented 1 month ago

On the web UI, the experience is that after I select the file to upload, it does not show me a document upload successful message. Other one page or few pages pdf I have uploaded, shows upload successful message.

I am using all the default mentioned in the OPEA ChatQnA example for Xeon. Please see below.

export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5" export RERANK_MODEL_ID="BAAI/bge-reranker-base" export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"

eero-t commented 1 month ago

I guess upload processing time is linearly related to amount of text => 37 page doc could take 20x longer than 2 page one. I.e. if 2 page uploaded goes in 2 mins, that doc could take 40 mins.

How long you've waited? And do other logs show timeouts?

mandalrajiv commented 1 month ago

The entrypoint of the dataprep container is a python program called "prepare_doc_redis.py". After I uploaded the document, I did htop utility to see all the resources (CPU, memory) being used by the processes initiated by the python program prepare_doc_redis.py. It takes failr long for the python program instances (prepare_doc_redis.py) to get completed. I have watched the CPU peak during the ingestion and then go down to the minimum once there is no other active process running the program prepare_doc_redis.py. So, I am fairly certain that I have waited the appropriate amount of time.

What other logs do you need? I can dig up the logs if you can tell me what other logs are required?

lkk12014402 commented 1 month ago

The entrypoint of the dataprep container is a python program called "prepare_doc_redis.py". After I uploaded the document, I did htop utility to see all the resources (CPU, memory) being used by the processes initiated by the python program prepare_doc_redis.py. It takes failr long for the python program instances (prepare_doc_redis.py) to get completed. I have watched the CPU peak during the ingestion and then go down to the minimum once there is no other active process running the program prepare_doc_redis.py. So, I am fairly certain that I have waited the appropriate amount of time.

What other logs do you need? I can dig up the logs if you can tell me what other logs are required?

hi, I have seen your logs and there is no errors. It seems the process is stuck at the Parsing document.

I use your data https://docs.aws.amazon.com/pdfs/whitepapers/latest/optimizing-postgresql-on-ec2-using-ebs/optimizing-postgresql-on-ec2-using-ebs.pdf?did=wp_card&trk=wp_card to upload to redis with dataprep-redis-server (prepare_doc_redis.py). I timing Parsing document process, it is indeed a bit slow, but the data can be uploaded successfully. In my side, the Parsing document process takes ~5mins. The Parsing document relies on easyocr to parse pdf file which is time-consuming.

So can you measure the time or print some logs during the upload process by revising the source code https://github.com/opea-project/GenAIComps/blob/main/comps/dataprep/utils.py#L91? Once you revise the code, you need rebuild the docker image.

Thanks~

js333031 commented 1 month ago

Document upload and processing needs to be made a background task so that user experience is improved. The UI should indicate that the document has been uploaded and processing is taking place. State of the document processing needs to be reflected in the UI for each uploaded document. Provide an estimate to completion, etc. Currently, the user thinks something went wrong, tries refresh or retries document upload.