infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
22.85k stars 2.24k forks source link

[Question]: Why images not get sent to the model even though the image-to-text model is set up #2585

Open cyhasuka opened 1 month ago

cyhasuka commented 1 month ago

Describe your problem

Why images not get sent to the model even though the image-to-text model is set up?

For now, pics will sent to Knowledge base and run OCR, and calling CV LLM when OCR does not recognise enough text. This makes sense if the model set in the Chat page is LLM. But when the model set in the page is CV LLM, Pics should be sent directly to the model. image

KevinHuSh commented 1 month ago

You need set it up here: image

cyhasuka commented 1 month ago

Yes, I already setup here, but problem continued. Pics has not been directly sent to cv llm.

Kevin Hu @.***>于2024年9月25日 周三19:22写道:

You need set it up here: image.png (view on web) https://github.com/user-attachments/assets/07e6f0c6-1af5-4939-b441-e9d8e533f176

— Reply to this email directly, view it on GitHub https://github.com/infiniflow/ragflow/issues/2585#issuecomment-2373805046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIG5UJQCMJSKVDJNGUVTWN3ZYKMGRAVCNFSM6AAAAABO2FFQK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZTHAYDKMBUGY . You are receiving this because you authored the thread.Message ID: @.***>

KevinHuSh commented 1 month ago

By our evaluation, without OCR, multi-modal LLM can't capture detail information for most cases, especially when there's text on it. So, could you elaberate on your cases?

cyhasuka commented 1 month ago

For example, this hf spaces Qwen2-VL asks for an image to be entered along with the prompt, and the result is output after processing.

According to our tests, the image to text models (e.g., qwen2-VL, GPT-4o) nowadays have much better image to text understanding performance than the dedicated OCR models in ragflow. Pictures that contain both text and geometric image features cannot be accurately understood by OCR.

Therefore, we believe that if a user specifies a CVLLM model in a dialogue, the picture should be sent to the model along with the prompt words.

Jas0nxlee commented 1 month ago

I also have the same problem and hope to send images from the knowledge base to a multimodal model.

KevinHuSh commented 1 month ago

I also have the same problem and hope to send images from the knowledge base to a multimodal model.

If OCR can't extract text from picture, it will be sent to multimodal model. Is that okay for you?

cyhasuka commented 1 month ago

I also have the same problem and hope to send images from the knowledge base to a multimodal model.

If OCR can't extract text from picture, it will be sent to multimodal model. Is that okay for you?

No. If a user specifies a CVLLM model in a dialogue, the picture should be sent to the model along with the prompt words.

To use an example of an application scenario, if I send a photo with an advertising board in the background of the photo. Then OCR will only recognize the text information in the advertisement plate, which I don't need. I need LLM to understand this image, not simply extract the text. Otherwise, it will be pointless for me to deploy CVLLM in chat.

cyhasuka commented 1 month ago

At the very least, the user can be given a choice on the front-end page.

Jas0nxlee commented 1 month ago

At the very least, the user can be given a choice on the front-end page.

it's a good point.