Open Fianax opened 1 month ago
The current logic does already detect when multiple line with equal header level font size follow each other.
But it does not yet remove always all line breaks when joining the header text fragments. This fix ensures this now.
Thanks for reporting this. This bug was present in your package version. In the future please make sure to confirm bugs with the current version.
The current logic does already detect when multiple line with equal header level font size follow each other.
But it does not yet remove always all line breaks when joining the header text fragments. This fix ensures this now.
So it's already solved?
Thanks for the answer and the speed
Thanks for reporting this. This bug was present in your package version. In the future please make sure to confirm bugs with the current version.
version ==0.0.9 is not the latest version for pdf4llm?
I thought it was because page pdf4llm said it was the latest.
Sorry for the confusion
Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.
BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.
That was a good point of yours though. I will make sure that the versions coincide in the future.
Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.
BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.
okey
thank you very much for the help and the explanation of 'fix developed'.
I will wait for the correction
That was a good point of yours though. I will make sure that the versions coincide in the future.
thanks to you for keeping the package “alive”.
I have pdfs with titles that occupy 2 or more lines and when the pdf is transformed to markdown, they are cut (because the pdf is cut).
I attach the original pdf and the generated markdown file:
prueba_indices_enormes.pdf prueba_indices_enormes_new_markdown.md
The content of the pdf is invented, the important thing is the result it gives with the indexes.
You can see that, when the index is very large and the pdf itself divides it into several lines, a small space is given and the new line has no '#' to indicate that it is part of the section title.
Is it something normal? is it an error in the markdown transformation?
I'm using pdf4llm version==0.0.9