infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
17.02k stars 1.73k forks source link

[Feature Request]: Requesting integration of widely used Tesseract OCR #1900

Open vishaldwdi opened 1 month ago

vishaldwdi commented 1 month ago

Is there an existing issue for the same feature request?

Is your feature request related to a problem?

I am aware that it has deepdoc, but would like to request integration of extremely popular and widely used Tesseract OCR which supports more than 100 languages.

Easily applicable.

Describe the feature you'd like

Requesting Easily Implementable Feature Enhancement to achieve this Workflow, so that life of Researchers, Investigators, Officials, and PhD Students could improve

(I'm adding this through the eyes of researcher, I'm aware that certain capabilities may already have been implemented in some way or form, I'm also aware that some part of this was requested earlier, my the goal is to make end product much more cohesive considering upcoming University Season)

# High Level Workflow__

Step 1: Add Document for RAG - User uploads a document (e.g., PDF, image, or text file) to the system. RagFlow performs RAG to store document in a respective database.

Step 2: RagFlow checks if Document Requires OCR RagFlow analyzes the document to determine if it requires OCR (Optical Character Recognition). - If the document is an image or scanned PDF, it likely requires OCR.

Step 3: OpenCV + Pillow Preprocessing prior OCR - If OCR is required, RagFlow utilizes Tesseract OCR with OpenCV and Pillow preprocessing to extract text from the document. The extracted data is then stored to improve respective database.

( I have personally tested that OpenCV+Pillow Preprocessing prior Tesseract improves complex text recognition by 52% while supporting more than 100 languages ).

Step 4: Database Improvement - If OCR was required, the extracted text is used to improve the database. If OCR was not required, RagFlow uses its inbuilt capabilities to improve the database with the uploaded document.

Step 5: User Enters Query - The user enters a query or question.

Step 6: Database Search and Web Search (if database insufficient) RagFlow searches the database to satisfy the user's query. - If the database search yields insufficient results, RagFlow utilizes a web search API (e.g., Google Custom Search or SearXNG) to fetch relevant results. The web search results are then stored in the database.

Step 7: RagFlow Processing - RagFlow processes the query using its LLM models accessed through APIs. The LLM models generate a response based on the database search and web search results.

Step 8: Response Generation - RagFlow generates a response to the user's query, utilizing the processed information. This workflow integrates OCR, web search, and LLM capabilities to provide accurate and up-to-date responses to user queries.

Reference: https://github.com/ItzCrazyKns/Perplexica https://github.com/tesseract-ocr/tesseract https://pypi.org/project/opencv-python/ https://pillow.readthedocs.io/en/stable/

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

netandreus commented 3 weeks ago

+1 I'm also intefesting of this feature.