transparentdemocracy / voting-data

Voting behavior data extracted from plenary reports of the Belgian federal government.
5 stars 1 forks source link

Added script to automate plenary reports #9

Closed karel1980 closed 2 months ago

karel1980 commented 2 months ago

Note: earlier experiments bumped into captchas while mass downloading from www.dekamer.be, but it seems to work for plenary reports.

sandervh14 commented 2 months ago

Nice contribution! Thanks!! :-)

karel1980 commented 2 months ago

I'll need some help with setting up the unit test correctly. I have a test for both the pdf and html parser running against ip298, but the assertions are different and I don't know which ones are correct. Is there a 1:1 relation between 'Motion' from model.py and 'Naamstemming: nnn' in the plenary report?

Unfortunately the html voting extractor only counts votes (based on Naamstemming: nnn sections at the end of the report). I don't yet know where or how to get the 'proposal' data for the Motion.

sandervh14 commented 2 months ago

I'll need some help with setting up the unit test correctly. I have a test for both the pdf and html parser running against ip298, but the assertions are different and I don't know which ones are correct. Is there a 1:1 relation between 'Motion' from model.py and 'Naamstemming: nnn' in the plenary report?

Unfortunately the html voting extractor only counts votes (based on Naamstemming: nnn sections at the end of the report). I don't yet know where or how to get the 'proposal' data for the Motion.

I fetched the description of the motion from the part Naamstemmingen in the middle of the document, where the description and counts of the votes appear, see https://github.com/transparentdemocracy/voting-data/blob/6c5f1ee6f30303dc87dcc935f6b33a296862b96d/src/voting_extractors.py#L50 and page 38 and next pages of ip298.pdf. Or is your question not about where to find it, but more a "I think it's not in the HTML?" Maybe not all info that is in the pdf is in the html counterpart, I should check.

I'll do that tomorrow evening, together with your unit test question.

As for the 1:1 relation: yes. On medium term, Motion will need to be extended with the filename of the plenary report, or the date of the plenary, or a link to a plenary class, otherwise after processing multiple plenaries, motions would get overwritten once this project submits the fetched data per report to the voring-service (work in progress in another repo) and with that, to a database.

karel1980 commented 2 months ago

I'm assuming the html and the PDF contain the same information, but double checking won't hurt. My question was more to make sure I understood the domain model completely because the extraction results are wildly different (see unit tests in my pr).

As for adding a plenary identifier: agreed. I'd prefer the number, not the date, though the date could be added as metadata.

Op zo 21 apr 2024 00:16 schreef Sander Vanden Hautte < @.***>:

I'll need some help with setting up the unit test correctly. I have a test for both the pdf and html parser running against ip298, but the assertions are different and I don't know which ones are correct. Is there a 1:1 relation between 'Motion' from model.py and 'Naamstemming: nnn' in the plenary report?

Unfortunately the html voting extractor only counts votes (based on Naamstemming: nnn sections at the end of the report). I don't yet know where or how to get the 'proposal' data for the Motion.

I fetched the description of the motion from the part Naamstemmingen in the middle of the document, where the description and counts of the votes appear, see

https://github.com/transparentdemocracy/voting-data/blob/6c5f1ee6f30303dc87dcc935f6b33a296862b96d/src/voting_extractors.py#L50 and page 38 and next pages of ip298.pdf. Or is your question not about where to find it, but more a "I think it's not in the HTML?" Maybe not all info that is in the pdf is in the html counterpart, I should check.

I'll do that tomorrow evening, together with your unit test question.

As for the 1:1 relation: yes. On medium term, Motion will need to be extended with the filename of the plenary report, or the date of the plenary, or a link to a plenary class, otherwise after processing multiple plenaries, motions would get overwritten once this project submits the fetched data per report to the voring-service (work in progress in another repo) and with that, to a database.

— Reply to this email directly, view it on GitHub https://github.com/transparentdemocracy/voting-data/pull/9#issuecomment-2067801583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABFLPLYYZZEK5EVZYK3STLY6LSLHAVCNFSM6AAAAABGQRSHRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXHAYDCNJYGM . You are receiving this because you were assigned.Message ID: @.***>

sandervh14 commented 2 months ago

I'll need to break my promise of working tonight, postponing to tomorrow. Sorry. Have a great evening!🙂

karel1980 commented 2 months ago

No worries, thanks for the heads up!

I've been making way too many changes on this branch, we should probably go over it together anyway, you be the judge.

Op zo 21 apr 2024 22:04 schreef Sander Vanden Hautte < @.***>:

I'll need to break my promise of working tonight, postponing to tomorrow. Sorry. Have a great evening!🙂

— Reply to this email directly, view it on GitHub https://github.com/transparentdemocracy/voting-data/pull/9#issuecomment-2068182788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABFLPNKSES3IMVJLUMPZM3Y6QLVRAVCNFSM6AAAAABGQRSHRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYGE4DENZYHA . You are receiving this because you were assigned.Message ID: @.***>

sandervh14 commented 2 months ago

Hey Karel,

I pulled your branch to be able to reply well.

Concerning the comparison of PDF versus HTML input report documents: I compared the ip298.pdf's table of contents versus the content op ip298x.html and can confirm their contents are identical. That's good news, then we can only store the html and just link to the pdf on the dekamer.be server.

I had a look at the unit tests and reflected on the domain model.

First of all, it's good you started a HTML processing class rather than the PDF, so we can get rid of the PDF processing artefacts easier than writing workarounds. Thanks!

Looking at the unit tests, I'd like us to land on a simple domain model. I'd like to have extract_all() to return a List[Plenary]. We can make the Plenary contain a list of motions, so we don't need to resort to return dicts of report names to lists of motions, and it will make serializing extracted plenaries to a simple, understandable JSON structure easier too, to feed a prototype application.

I think with that Plenary class in place, there will not be a need anymore of the MotionId class, which makes it quite abstract to understand how the domain model classes are tied together.

I see your extract methods now return motions, not proposals. Probably related to your question of where I found the actual proposals and descriptions. No problem, you can finish your pull request without that, and I can write some code into your HTML extractor to extract the proposal descriptions, using the parts of the HTML extractor you've made already.

I like the added problem detection, like the vote count not matching the number of voters detected. Nice.

Apart from refactoring suggestions on mainly the domain model, I agree with your unit tests!

Let's park having a look at the plenary reports that led to parsing problems (see your test_voting_extractors.py:54) to a next github issue & pull request.

Let me know how you want to proceed: can you finish your PR with these remarks? Would you like finishing it pair programming? Or, since I remember you said you had a busy week, I could also merge your current changes and create a new PR of my own, making the changes corresponding with my above comments. :-)

karel1980 commented 2 months ago

All agreed. I was hesitant to introduce it in case we add other kinds of reports, but it's premature abstraction.

I'll introduce plenary and remove the PDF parsing bits.

I have an idea about parsing the rest of the html, but I may not be able to finish it before the weekend.

Ciao

Op ma 22 apr 2024 23:35 schreef Sander Vanden Hautte < @.***>:

Hey Karel,

I pulled your branch to be able to reply well.

Concerning the comparison of PDF versus HTML input report documents: I compared the ip298.pdf's table of contents versus the content op ip298x.html and can confirm their contents are identical. That's good news, then we can only store the html and just link to the pdf on the dekamer.be server.

I had a look at the unit tests and reflected on the domain model.

First of all, it's good you started a HTML processing class rather than the PDF, so we can get rid of the PDF processing artefacts easier than writing workarounds. Thanks!

Looking at the unit tests, I'd like us to land on a simple domain model. I'd like to have extract_all() to return a List[Plenary]. We can make the Plenary contain a list of motions, so we don't need to resort to return dicts of report names to lists of motions, and it will make serializing extracted plenaries to a simple, understandable JSON structure easier too, to feed a prototype application.

I think with that Plenary class in place, there will not be a need anymore of the MotionId class, which makes it quite abstract to understand how the domain model classes are tied together.

I see your extract methods now return motions, not proposals. Probably related to your question of where I found the actual proposals and descriptions. No problem, you can finish your pull request without that, and I can port my code into yours, using the parts of the HTML extractor you've made.

I like the added problem detection, like the vote count not matching the number of voters detected. Nice.

Apart from refactoring suggestions on mainly the domain model, I agree with your unit tests!

Let's park having a look at the plenary reports that led to parsing problems (see your test_voting_extractors.py:54) to a next github issue & pull request.

— Reply to this email directly, view it on GitHub https://github.com/transparentdemocracy/voting-data/pull/9#issuecomment-2070992654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABFLPJLP447OOW3Z22RFWDY6V7CVAVCNFSM6AAAAABGQRSHRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQHE4TENRVGQ . You are receiving this because you were assigned.Message ID: @.***>

sandervh14 commented 2 months ago

Good! Can you do that parsing of the rest of the html you wanted to do, in another pull request? Then when this one is finished, we can again work in parallel. :-)

karel1980 commented 2 months ago

Yes, I would like to merge this as soon as possible too. What would be needed to make this PR mergeable?

Op di 23 apr 2024 09:00 schreef Sander Vanden Hautte < @.***>:

Good! Can you do that parsing of the rest of the html you wanted to do, in another pull request? Then when this one is finished, we can again work in parallel. :-)

— Reply to this email directly, view it on GitHub https://github.com/transparentdemocracy/voting-data/pull/9#issuecomment-2071572241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABFLPK2W6Z627L54WJDXGLY6YBJBAVCNFSM6AAAAABGQRSHRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZRGU3TEMRUGE . You are receiving this because you were assigned.Message ID: @.***>

sandervh14 commented 2 months ago

Only as much as you want or can spend. :-) We could even merge here and I perform the changes I had in mind, as I proposed as last option. I just wanted to give you a choice. :-) Up to you!

karel1980 commented 2 months ago

In that case, please merge. IIRC the main flow wasn't changed yet so we can switch to html parsing whenever we feel it's ready.

Op di 23 apr 2024 09:16 schreef Sander Vanden Hautte < @.***>:

Only as much as you want or can spend. :-) We could even merge here and I perform the changes I had in mind, as I proposed as last option. I just wanted to give you a choice. :-) Up to you!

— Reply to this email directly, view it on GitHub https://github.com/transparentdemocracy/voting-data/pull/9#issuecomment-2071595453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABFLPM4XOYZNS63DTV4FLDY6YDDPAVCNFSM6AAAAABGQRSHRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZRGU4TKNBVGM . You are receiving this because you were assigned.Message ID: @.***>

sandervh14 commented 2 months ago

OK! Done. Thanks for your work!! :-)