Table is not extracted and some text order was wrong for this PDF

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

518 stars 81 forks source link

Table is not extracted and some text order was wrong for this PDF #138

Closed bbfrog closed 2 months ago

bbfrog commented 2 months ago

hi, I am testing converting this pdf to markdown text: https://www.jacionline.org/action/showPdf?pii=S0091-6749%2822%2901181-2

There are two problems: the second problem is more important. Thanks!

All tables in this PDF (Page 4, 6, 9) are not extracted as markdown format.
The text order in page 4 is wrong: The right panel text (starts with "relationship. The multiple..") is before the left panel text (starts with "with CSU was investigated..."). Page 7 has the same problem.

JorjMcKie commented 2 months ago

I cannot do anything about the tables: this type can simply not be detected by the table finder. But the other aspect - the one more important for you - has a fix, which will be part of the next version.

bbfrog commented 2 months ago

I cannot do anything about the tables: this type can simply not be detected by the table finder. But the other aspect - the one more important for you - has a fix, which will be part of the next version.

Great, thank JoriMcKie very much! Could you please let me know the release date of the next version?

JorjMcKie commented 2 months ago

Fixed in version 0.0.15.

bbfrog commented 2 months ago

Thanks very much!