Open cyhasuka opened 1 month ago
You need set it up here:
Yes, I already setup here, but problem continued. Pics has not been directly sent to cv llm.
Kevin Hu @.***>于2024年9月25日 周三19:22写道:
You need set it up here: image.png (view on web) https://github.com/user-attachments/assets/07e6f0c6-1af5-4939-b441-e9d8e533f176
— Reply to this email directly, view it on GitHub https://github.com/infiniflow/ragflow/issues/2585#issuecomment-2373805046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIG5UJQCMJSKVDJNGUVTWN3ZYKMGRAVCNFSM6AAAAABO2FFQK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZTHAYDKMBUGY . You are receiving this because you authored the thread.Message ID: @.***>
By our evaluation, without OCR, multi-modal LLM can't capture detail information for most cases, especially when there's text on it. So, could you elaberate on your cases?
For example, this hf spaces Qwen2-VL asks for an image to be entered along with the prompt, and the result is output after processing.
According to our tests, the image to text models (e.g., qwen2-VL, GPT-4o) nowadays have much better image to text understanding performance than the dedicated OCR models in ragflow. Pictures that contain both text and geometric image features cannot be accurately understood by OCR.
Therefore, we believe that if a user specifies a CVLLM model in a dialogue, the picture should be sent to the model along with the prompt words.
I also have the same problem and hope to send images from the knowledge base to a multimodal model.
I also have the same problem and hope to send images from the knowledge base to a multimodal model.
If OCR can't extract text from picture, it will be sent to multimodal model. Is that okay for you?
I also have the same problem and hope to send images from the knowledge base to a multimodal model.
If OCR can't extract text from picture, it will be sent to multimodal model. Is that okay for you?
No. If a user specifies a CVLLM model in a dialogue, the picture should be sent to the model along with the prompt words.
To use an example of an application scenario, if I send a photo with an advertising board in the background of the photo. Then OCR will only recognize the text information in the advertisement plate, which I don't need. I need LLM to understand this image, not simply extract the text. Otherwise, it will be pointless for me to deploy CVLLM in chat.
At the very least, the user can be given a choice on the front-end page.
At the very least, the user can be given a choice on the front-end page.
it's a good point.
Describe your problem
Why images not get sent to the model even though the image-to-text model is set up?
For now, pics will sent to Knowledge base and run OCR, and calling CV LLM when OCR does not recognise enough text. This makes sense if the model set in the Chat page is LLM. But when the model set in the page is CV LLM, Pics should be sent directly to the model.