jzillmann / pdf-to-markdown

A PDF to Markdown converter
https://pdf2md.morethan.io
MIT License
1.2k stars 196 forks source link

Scrambles the contents of some outline/hierarchical PDF documents #77

Open steveisakson opened 2 months ago

steveisakson commented 2 months ago

Generate a PDF of this page: https://www.ecfr.gov/current/title-14/chapter-I/subchapter-G/part-139

Convert to MD with pdf-to-markdown.

Compare the PDF with MD. Headings are several lines before the paragraph text that follows in the PDF. Start at the end to find more pronounced differences.

I haven't examined the PDF contents, so this might be related more to the PDFs or how the doc-to-pdf is configured on eCFR.gov. OTOH, they are automatically generated by a (presumably) commercial package. And eCFR has millions of users.

PS - It's not all bad. Your PDF parsing knocks the socks off a lot of other online tools. And the translation to MD is great — thanks!