pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

A custom sorting method is required #60

Closed yoke233 closed 4 months ago

yoke233 commented 4 months ago

The default sorting method is not suitable; it is necessary to add a parameter to accommodate a custom sorting function.

JorjMcKie commented 4 months ago

Please be more specific: Sorting what / where?

yoke233 commented 4 months ago

bboxes from text image image source

the x of purple line block is left then green one , when sorted, it will be placed before the green line block, which obviously violates the read order.

JorjMcKie commented 4 months ago

Ah ok. Thanks for the explanation.

Please try the latest version 0.0.6 - you did not mention the version you use. My goal in general however is to detect situations like the one you describe.

I also understand now that you are referring to sorting the sequence of the detected rectangles that finally represent page columns. Let me consider how a user-supplied callable could be invoked. Currently there is a simple lambda function that uses the corners of these rectangles.

If you can provide your example page I would check the behavior of the code too.

JorjMcKie commented 4 months ago

Closing for lack of feedback over an extended period of time. Please open another issue if you can provide a reproducing file.