Poor PDF parsing results, especially in multi-column documents

Let's take as an example: https://www.stateninformatie.provincie-utrecht.nl/api/v1/meetings/8992/documents/23264

In the PDF, we find a multi-column style like this:

Current Output

Whilst it might appear okay, the parser has lost the information relating to the order of these sentence parts, so reconstructing the actual paragraphs becomes difficult.

Ieder jaar laat Provincie Utrecht circa 10% van het    grens met Provincie Noord-Holland. Weer naar het zui-
landelijk gebied onderzoeken op flora en fauna. In        den vormt eerst de Angstel en later de A2 de grens tot

Desired Output

Ieder jaar laat Provincie Utrecht circa 10% van het
landelijk gebied onderzoeken op flora en fauna. In
...
grens met Provincie Noord-Holland. Weer naar het zui-
den vormt eerst de Angstel en later de A2 de grens tot
aan de bebouwde kom van Maarssen. Naar het oosten
toe vormt de bebouwde kom van Utrecht de grens tot
aan de Utrechtse Heuvelrug bij De Bilt en Bilthoven.

Suggested Solution

the PDFtoText module permits the use of a 'raw' parameter that swaps the 'default' parsing mode to 'raw', which is the order in which content appears in the stream.

https://github.com/openstate/open-raadsinformatie/blob/01a28593432d4038b9211877d54d75ef355d43de/ocd_backend/utils/file_parsing.py#L15

      for i, page in enumerate(pdftotext.PDF(f, raw=True), start=1):

It could be even better to parse all PDFs to xml, as that allows for retention of more structure, and for instance allows the identification of headings, table of contents and tables.

openstate / open-raadsinformatie