Mistakes in orchestrating sentences

madlogos commented 5 months ago

Thanks for the authors who developed pymupdf4llm. It has made the text extraction from pdf much easier.

I detected a layouting bug when extracting texts from an academic paper. The method .to_markdown() correctly identified most of the contents in the file attached.

pg_7.pdf

But the order of some sentences are incorrect.

Correct text (by pymupdf's original tool page.get_text()):

dose of corticosteroids, the dose was increased to 7Æ5 mg daily and then reduced to 5 mg daily when disease progression was arrested. Within 1–3 months of treatment, 89% of patients with progressive disease stabilized, while within 2–4 months, repigmentation was observed in 80% of the patients. The area of repigmentation continued to progress as treatment contin- ued, although none of the patients achieved complete re- pigmentation. Radakovic-Fijan et al.77 used dexamethasone minipulses of 10 mg daily on two consecutive days per week up to 24 weeks. Disease activity was arrested in 88% of patients with progressive disease after 18Æ2 weeks of treat- ment. Side-effects (weight gain, insomnia, agitation, acne, menstrual disturbances and hypertrichosis) were observed in 69% of patients. Overall, OMP with either betamethasone or dexamethasone can arrest, without inducing repigmentation, the progression of vitiligo. During fast-spreading vitiligo, phototherapy is usually commenced after this intervention. However, there are no randomized clinical trials (RCTs) con- ﬁrming that either speed or magnitude of response to photo- therapy and photochemotherapy might be potentiated by concomitant administration of OMP.

However, pymupdf4llm gets the text below (the wrong part has been bold):

dose of corticosteroids, the dose was increased to 7 Æ 5 mg daily and then reduced to 5 mg daily when disease progression was arrested. Within 1–3 months of treatment, 89% of patients with progressive disease stabilized, while within 2–4 months, repigmentation was observed in 80% of the patients. The area

77 of repigmentation continued to progress as treatment continued, although none of the patients achieved complete repigmentation. Radakovic-Fijan et al. used dexamethasone minipulses of 10 mg daily on two consecutive days per week up to 24 weeks. Disease activity was arrested in 88% of 2 weeks of treatment. Side-effects (weight gain, insomnia, agitation, acne, patients with progressive disease after 18 Æ menstrual disturbances and hypertrichosis) were observed in 69% of patients. Overall, OMP with either betamethasone or dexamethasone can arrest, without inducing repigmentation, the progression of vitiligo. During fast-spreading vitiligo, phototherapy is usually commenced after this intervention. However, there are no randomized clinical trials (RCTs) confirming that either speed or magnitude of response to phototherapy and photochemotherapy might be potentiated by concomitant administration of OMP.

Would there be a fix to address this? Thanks.

JorjMcKie commented 5 months ago

This ironically happens because of the advanced line resynthesizing feature. Footnote references are smaller than the accompanying text and they also are elevated above the baseline coordinate of their respective line.

The change to be published corrects this such that your examples will look like

The area of repigmentation continued to progress as treatment continued, although none of the
patients achieved complete repigmentation. Radakovic-Fijan et al. [77] used dexamethasone 
minipulses of 10 mg daily on two consecutive days per week up to 24 weeks.

madlogos commented 4 months ago

That's a great improvement. Just want to double check, would this fix correct the sentence

Disease activity was arrested in 88% of 2 weeks of treatment. Side-effects (weight gain, insomnia, agitation, acne, patients with progressive disease after 18 Æ menstrual disturbances and hypertrichosis) were observed in 69% of patients.

to

Disease activity was arrested in 88% of patients with progressive disease after 18Æ2 weeks of treatment. Side-effects (weight gain, insomnia, agitation, acne, menstrual disturbances and hypertrichosis) were observed in 69% of patients.

? The bold part was incorrectly inserted before patients with progressive disease. I think this is more critical.

Many thanks.

JorjMcKie commented 4 months ago

This is the full outcome: pg_7.md

JorjMcKie commented 4 months ago

Solved in version 0.0.6.

madlogos commented 4 months ago

This is the full outcome: pg_7.md

Thank you for bug fixing!

However, I think the sentence order is still incorrect.

JorjMcKie commented 4 months ago

Thanks for reporting this! You are quite right. The problem is caused by using the text extraction flag TEXT_DEHYPHENATE. This confuses the logic that puts together lines in the same spans. Everything (I hope) is fine if we give up automatic de-hyphenation. See this result: pg_7.md

pymupdf / RAG

Mistakes in orchestrating sentences #54