transparentdemocracy / voting-data

Voting behavior data extracted from plenary reports of the Belgian federal government.
5 stars 1 forks source link

Improve robustness of text extraction against unwanted text #5

Closed sandervh14 closed 2 months ago

sandervh14 commented 2 months ago

For example:

karel1980 commented 2 months ago

The plenaries have both pdf and html. Html might be easier to convert to text without pdf typesetting artefacts.

sandervh14 commented 2 months ago

The plenaries have both pdf and html. Html might be easier to convert to text without pdf typesetting artefacts.

True, good spot! I've noticed it too in the past week, see https://github.com/transparentdemocracy/voting-data/issues/1. I didn't know it when I started building the extraction.

I was thinking of continuing to build the back-end and front-end for a voting test prototype and only then coming back to make sure more and better votes would get extracted.

But we can discuss that, or contributions could allow fixing both at the same time.🙂 Depends on which priorities people see. What do you think?

karel1980 commented 2 months ago

I see a lot of advantages in working with plain text:

Overall, I would recommend converting to PDF asap in a separate command so you can make things nice and fast. I've made some progress on my html based implementation to get the votes, I'd like to keep working on it, but depending on priorities I can switch to different tasks

sandervh14 commented 2 months ago

Perfectly fine! Thanks for the work!

sandervh14 commented 2 months ago

@karel1980 took care of this when he submitted PR #9, which is merged now.