Not all input reports & motions (votings) are fetched currently

sandervh14 commented 2 months ago

Find out why some are skipped / not properly processed.

karel1980 commented 2 months ago

It's not perfect, but none of the files are currently failing. Are there any specific problems that we should address in scope of this issue?

sandervh14 commented 1 month ago

I created this issue when the HTML extractor didn't exist yet. As we both know, that one works better, due to less PDF processing artefacts.

I'll have a look in the coming days if there are still unexpected results.

Note to self: check the following:

warnings on the terminal during extraction like the following: WARNING:root:vote count (12) does not match voters ['Bury Katleen', 'Creyelman Steven', 'De Spiegeleer Pieter', 'Depoortere Ortwin', 'Dewulf Nathalie', 'Dillen Marijke', 'Gilissen Erik', 'Pas Barbara', 'Ponthier Annick', 'Ravyts Kurt', 'Samyn Ellen', 'Sneppe Dominiek', 'Troosters Frank', 'Van Grieken Tom', 'Van Langenhove Dries', 'Van Lommel Reccino', 'Vermeersch Wouter', 'Verreyt Hans']
Any logging.warning("Failed to process %s") logged?
Do some spot checks: extracted info versus existing documents.

sandervh14 commented 1 month ago

I'll close this issue. We're have identified and will still identify reports that haven't been processed the way we expected them to be, and turned that (or will turn them) into additional unit tests. See test_extraction.py. So, ongoing, but we don't need a separate issue for this anymore. We're on it as part of our other work.

transparentdemocracy / voting-data

Not all input reports & motions (votings) are fetched currently #4