pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Very long titles when converting to markdown #158

Open Fianax opened 1 month ago

Fianax commented 1 month ago

I have pdfs with titles that occupy 2 or more lines and when the pdf is transformed to markdown, they are cut (because the pdf is cut).

I attach the original pdf and the generated markdown file:

prueba_indices_enormes.pdf prueba_indices_enormes_new_markdown.md

The content of the pdf is invented, the important thing is the result it gives with the indexes.

You can see that, when the index is very large and the pdf itself divides it into several lines, a small space is given and the new line has no '#' to indicate that it is part of the section title.

Is it something normal? is it an error in the markdown transformation?


I'm using pdf4llm version==0.0.9

md_text = pdf4llm.to_markdown(
        doc='temp/prueba_indices_enormes.pdf',
        margins=0,
    )
JorjMcKie commented 1 month ago

The current logic does already detect when multiple line with equal header level font size follow each other.

But it does not yet remove always all line breaks when joining the header text fragments. This fix ensures this now.

JorjMcKie commented 1 month ago

Thanks for reporting this. This bug was present in your package version. In the future please make sure to confirm bugs with the current version.

Fianax commented 1 month ago

The current logic does already detect when multiple line with equal header level font size follow each other.

But it does not yet remove always all line breaks when joining the header text fragments. This fix ensures this now.

So it's already solved?

Thanks for the answer and the speed

Fianax commented 1 month ago

Thanks for reporting this. This bug was present in your package version. In the future please make sure to confirm bugs with the current version.

version ==0.0.9 is not the latest version for pdf4llm?

I thought it was because page pdf4llm said it was the latest.

Sorry for the confusion

JorjMcKie commented 1 month ago

Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.

BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.

JorjMcKie commented 1 month ago

That was a good point of yours though. I will make sure that the versions coincide in the future.

Fianax commented 1 month ago

Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.

BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.

okey

thank you very much for the help and the explanation of 'fix developed'.

I will wait for the correction

Fianax commented 1 month ago

That was a good point of yours though. I will make sure that the versions coincide in the future.

thanks to you for keeping the package “alive”.