pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
518 stars 81 forks source link

First column of table is repeated before the actual table #173

Closed johnmara-pc14 closed 2 weeks ago

johnmara-pc14 commented 3 weeks ago

Description of the bug

I have a pdf with the following structure: text table text

When I call pymupdf4llm.to_markdown() the resulting html has the below behavior: text first column of table text table

For example, the md for the below pdf is: sample_pdf.pdf

This is a story about customer loans.

Col 1

Value 1

Another section goes here

|Col 1|Col 2|
|---|---|
|Value 1|Value 2|

-----

It does not repeat the table only if its the last element in the page.

How to reproduce the bug

In order to reproduce it, use the attached pdf and below code:

pymupdf4llm.to_markdown(
        doc=f"{example_docs_dir}/{filename}",
    )

PyMuPDF version

1.24.11

Operating system

MacOS

Python version

3.11

JorjMcKie commented 3 weeks ago

You should have reported this in pymupdf4llm because there is no issue in pymupdf. The problem version of pymupdf4llm is 0.0.17. It works ok in 0.0.16.

Meaveryway commented 3 weeks ago

This seems related to this issue #171 and the fix suggested there gives this output:

This is a story about customer loans.

|Col 1|Col 2|
|---|---|
|Value 1|Value 2|

Another section goes here

-----
johnmara-pc14 commented 2 weeks ago

You should have reported this in pymupdf4llm because there is no issue in pymupdf. The problem version of pymupdf4llm is 0.0.17. It works ok in 0.0.16.

Switching to previous version seems to do the trick. Thank you!