Improve robustness of text extraction against unwanted text

transparentdemocracy / voting-data

Voting behavior data extracted from plenary reports of the Belgian federal government.

5 stars 1 forks source link

Improve robustness of text extraction against unwanted text #5

Closed sandervh14 closed 2 months ago

sandervh14 commented 2 months ago

For example:

ignore headers like CRIV 55 PLEN 298 04/04/2024 CHAMBRE -6E SESSION DE LA 55E LÉGISLATURE 2023 2024 KAMER -6E ZITTING VAN DE 55E ZITTINGSPERIOD
Transform the vote extraction into a state machine? (e.g. expect "no votes" after "yes votes", if this is not the case, fail)
do a general quality check vs the input documents.

karel1980 commented 2 months ago

The plenaries have both pdf and html. Html might be easier to convert to text without pdf typesetting artefacts.

sandervh14 commented 2 months ago

The plenaries have both pdf and html. Html might be easier to convert to text without pdf typesetting artefacts.

True, good spot! I've noticed it too in the past week, see https://github.com/transparentdemocracy/voting-data/issues/1. I didn't know it when I started building the extraction.

I was thinking of continuing to build the back-end and front-end for a voting test prototype and only then coming back to make sure more and better votes would get extracted.

But we can discuss that, or contributions could allow fixing both at the same time.🙂 Depends on which priorities people see. What do you think?

karel1980 commented 2 months ago

I see a lot of advantages in working with plain text:

pdf reading is super slow. You'd want an extensive test suite and you don't want it to be slow.
text files would make it easier to come up with tests for weird edge case scenarios: editing text files is a lot easier than editing pdfs

Overall, I would recommend converting to PDF asap in a separate command so you can make things nice and fast. I've made some progress on my html based implementation to get the votes, I'd like to keep working on it, but depending on priorities I can switch to different tasks

sandervh14 commented 2 months ago

Perfectly fine! Thanks for the work!

sandervh14 commented 2 months ago

@karel1980 took care of this when he submitted PR #9, which is merged now.