Open tsaltena opened 2 years ago
Thanks for pointing this out, and taking the time to write up a proposal, @tsaltena! Much appreciated.
Seems like a simple fix. I do need to check whether this doesn't break existing pages.
@robvandijk maybe this is a good issue to take a look at when you can import, this would be a relevant functionality to improve!
Let's take as an example: https://www.stateninformatie.provincie-utrecht.nl/api/v1/meetings/8992/documents/23264
In the PDF, we find a multi-column style like this:
Current Output
Whilst it might appear okay, the parser has lost the information relating to the order of these sentence parts, so reconstructing the actual paragraphs becomes difficult.
Desired Output
Suggested Solution
the PDFtoText module permits the use of a 'raw' parameter that swaps the 'default' parsing mode to 'raw', which is the order in which content appears in the stream.
https://github.com/openstate/open-raadsinformatie/blob/01a28593432d4038b9211877d54d75ef355d43de/ocd_backend/utils/file_parsing.py#L15
It could be even better to parse all PDFs to xml, as that allows for retention of more structure, and for instance allows the identification of headings, table of contents and tables.