[Question]: How can I extract text automatically segmented as per the layout

infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

https://ragflow.io

Apache License 2.0

22.62k stars 2.21k forks source link

[Question]: How can I extract text automatically segmented as per the layout #257

Open shubhdotai opened 7 months ago

shubhdotai commented 7 months ago

Describe your problem

If there is a pdf with 2 columns with headings and tables. I want to extract the text/OCR result separately for individual layout segments. How can I do it directly just by using deepdoc?

KevinHuSh commented 7 months ago

You need to use layout recognizer. Please look into code in rag/app. May this help.

Thanks for following

shubhdotai commented 7 months ago

Layout recogniser only returns the layout (bounding box and corresponding label). However it doesn't return the text data in that box. Any direct function or code for that?

KevinHuSh commented 7 months ago

This function is for this purpose.