opengovsg / pdf2md

A PDF to Markdown converter
https://www.npmjs.com/package/@opendocsg/pdf2md
MIT License
193 stars 39 forks source link

Two column layout not parsed correctly #91

Open irian-codes opened 1 month ago

irian-codes commented 1 month ago

Describe the bug Testing a two column layout PDF it seems the text gets misplaced in the final markdown result.

To Reproduce Parse Root board board game manual pdf or Aliens AGDITC board game manual

Expected behavior It should detect columns and join the text correctly in the final output without any other character in between. Right now, if you try to see the continuation of a left column to the right column in any of the example pdfs the text parts don't match.

For example, in the Root Manual:

Example in the Aliens manual:

Desktop (please complete the following information):

irian-codes commented 1 month ago

Okay seems columns are switched, like, it parses first the right column and then the left one. Which is incorrect for LTR documents.