Closed madlogos closed 4 months ago
This ironically happens because of the advanced line resynthesizing feature. Footnote references are smaller than the accompanying text and they also are elevated above the baseline coordinate of their respective line.
The change to be published corrects this such that your examples will look like
The area of repigmentation continued to progress as treatment continued, although none of the
patients achieved complete repigmentation. Radakovic-Fijan et al. [77] used dexamethasone
minipulses of 10 mg daily on two consecutive days per week up to 24 weeks.
That's a great improvement. Just want to double check, would this fix correct the sentence
Disease activity was arrested in 88% of 2 weeks of treatment. Side-effects (weight gain, insomnia, agitation, acne, patients with progressive disease after 18 Æ menstrual disturbances and hypertrichosis) were observed in 69% of patients.
to
Disease activity was arrested in 88% of patients with progressive disease after 18Æ2 weeks of treatment. Side-effects (weight gain, insomnia, agitation, acne, menstrual disturbances and hypertrichosis) were observed in 69% of patients.
? The bold part was incorrectly inserted before patients with progressive disease
. I think this is more critical.
Many thanks.
Solved in version 0.0.6.
This is the full outcome: pg_7.md
Thank you for bug fixing!
However, I think the sentence order is still incorrect.
Thanks for reporting this!
You are quite right. The problem is caused by using the text extraction flag TEXT_DEHYPHENATE
. This confuses the logic that puts together lines in the same spans.
Everything (I hope) is fine if we give up automatic de-hyphenation. See this result:
pg_7.md
Thanks for the authors who developed pymupdf4llm. It has made the text extraction from pdf much easier.
I detected a layouting bug when extracting texts from an academic paper. The method
.to_markdown()
correctly identified most of the contents in the file attached.pg_7.pdf
But the order of some sentences are incorrect.
Correct text (by pymupdf's original tool
page.get_text()
):However, pymupdf4llm gets the text below (the wrong part has been bold):
Would there be a fix to address this? Thanks.