transparentdemocracy / voting-data

Voting behavior data extracted from plenary reports of the Belgian federal government.
5 stars 1 forks source link

Motion descriptions: Many proposal descriptions may no longer be extracted correctly (regression?) #25

Closed sandervh14 closed 1 month ago

sandervh14 commented 1 month ago

Many proposal descriptions look no longer high-quality, it may be a regression I just introduced this evening, or it may have been in there already for a couple of days.

This impacts what the website team gets to see on their prototype. Our prototype is only as clear as the descriptions of the proposals voted for.

Example:

{
                "id": "55_202_6",
                "number": 6,
                "plenary_id": "55_202",
                "description": "\n(Stemming/vote 5)\n\n(Stemming/vote 5)\n(Stemming/vote 5)\n(Stemming/vote 5)\n\n\n\nJa\n\n\n37\n\n\nOui\n\n\n\nJa\n\nJa\nJa\n\n\n37\n\n37\n37\n\n\nOui\n\nOui\nOui\n\n\n\nNee\n\n\n91\n\n\nNon\n\n\n\nNee\n\nNee\nNee\n\n\n91\n\n91\n91\n\n\nNon\n\nNon\nNon\n\n\n\nOnthoudingen\n\n\n1\n\n\nAbstentions\n\n\n\nOnthoudingen\n\nOnthoudingen\nOnthoudingen\n\n\n1\n\n1\n1\n\n\nAbstentions\n\nAbstentions\nAbstentions\n\n\n\nTotaal\n\n\n129\n\n\nTotal\n\n\n\nTotaal\n\nTotaal\nTotaal\n\n\n129\n\n129\n129\n\n\nTotal\n\nTotal\nTotal\n\n\u00a0\n\u00a0\n\nEn cons\u00e9quence, l'amendement est rejet\u00e9 et\nl'article 38 est adopt\u00e9.\nEn cons\u00e9quence, l'amendement est rejet\u00e9 et\nl'article 38 est adopt\u00e9.\nBijgevolg is het amendement verworpen en is\nartikel 38 aangenomen.\nBijgevolg is het amendement verworpen en is\nartikel 38 aangenomen.\n\u00a0\n\u00a0\n\n38 Ensemble du projet de loi\nportant dispositions diverses en mati\u00e8re d'\u00e9conomie (nouvel intitul\u00e9)\u00a0 (2742/7)\n\n38 Ensemble du projet de loi\nportant dispositions diverses en mati\u00e8re d'\u00e9conomie (nouvel intitul\u00e9)\u00a0 (2742/7)\n38 Ensemble du projet de loi\nportant dispositions diverses en mati\u00e8re d'\u00e9conomie (nouvel intitul\u00e9)\u00a0 (2742/7)\n38 Ensemble du projet de loi\nportant dispositions diverses en mati\u00e8re d'\u00e9conomie (nouvel intitul\u00e9)\u00a0 (2742/7)\n\u00a0 \n38 Geheel van het wetsontwerp\nhoudende diverse bepalingen inzake economie (nieuw opschrift)\u00a0 (2742/7)\n\n38 Geheel van het wetsontwerp\nhoudende diverse bepalingen inzake economie (nieuw opschrift)\u00a0 (2742/7)\n38 Geheel van het wetsontwerp\nhoudende diverse bepalingen inzake economie (nieuw opschrift)\u00a0 (2742/7)\n38 Geheel van het wetsontwerp\nhoudende diverse bepalingen inzake economie (nieuw opschrift)\u00a0 (2742/7)\n\u00a0 \n\u00a0\n\u00a0\n\nQuelqu'un demande-t-il la parole pour une\nd\u00e9claration avant le vote? (Non)\nQuelqu'un demande-t-il la parole pour une\nd\u00e9claration avant le vote? \n \n(Non)\n(Non)\n\n\nVraagt iemand het woord voor een\nstemverklaring? (Nee)\nVraagt iemand het woord voor een\nstemverklaring? (Nee)\n (Nee)\n\u00a0\n\u00a0\n\nBegin van de\nstemming / D\u00e9but du vote.\nBegin van de\nstemming / D\u00e9but du vote.\n\nHeeft\niedereen gestemd en zijn stem nagekeken? / Tout le monde a-t-il vot\u00e9 et v\u00e9rifi\u00e9\nson vote?\nHeeft\niedereen gestemd en zijn stem nagekeken? / Tout le monde a-t-il vot\u00e9 et v\u00e9rifi\u00e9\nson vote?\n\nEinde van de stemming\n/ Fin du vote.\nEinde van de stemming\n/ Fin du vote.\n\nUitslag van de\nstemming / R\u00e9sultat du vote.\nUitslag van de\nstemming / R\u00e9sultat du vote.\n\n\u00a0\n\u00a0\n"
            },
karel1980 commented 1 month ago

Not sure if this was ever very good, finetuning this is definitely still high on the list

sandervh14 commented 1 month ago

Maybe, I wasn't sure. :D Good we are aligned. :-)

I'll take this one up as first thing.

sandervh14 commented 1 month ago

+ request from Guido: separate NL an FR

karel1980 commented 1 month ago

Completely separating them doesn't make sense because text isn't always available in both languages. It may be possible to derive a lot of information from the HTML structure, but I don't want to become too dependent on that.

I'd keep the HTML elements while extracting. This way we keep the classes or lang attributes that give information about the language.

On Fri, 3 May 2024 at 17:45, Sander Vanden Hautte @.***> wrote:

  • request from Guido: separate NL an FR

— Reply to this email directly, view it on GitHub https://github.com/transparentdemocracy/voting-data/issues/25#issuecomment-2093271628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABFLPNYF36UG3H4O7AJBHTZAOWH3AVCNFSM6AAAAABHBFZM66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGI3TCNRSHA . You are receiving this because you commented.Message ID: @.***>

sandervh14 commented 1 month ago

OK. I'll see what I can do about the NL/FR thing. I'll keep your thoughts of above in mind.

sandervh14 commented 1 month ago

Preparation of the reworking required to detect the correct descriptions of proposals, linked to voted motions.

Investigated on ip298x.html.

Motions

Motions follow after two h1 tags, the first h1 tag containing "Naamstemmingen", the second h1 tag containing "Votes nominatifs":

image

Consider the following example:

image

Each motion can be recognized as two consecutive h2 tags, the first h2 tag always containing the title of the motion in French, within the span[lang="FR"] tag, the second h2 tag always containing the title of the motion in Dutch, within the span[lang="NL"] tag.

Note: Contrary to what I first thought, the number in a box is not a reference to find the corresponding proposal text. It is just a sequence number indicating a new motion to be voted on.

At the end of the motion title, a reference between parenthesis follows, pointing to the proposal document. For the above screenshot, that is 3849/4.

Proposals

The proposal document reference can be used to find the discussion of the proposal in the plenary report itself, a first indication of what the proposal is about. It can be found after the sequence of the 2 h1 tags with text "Wetsontwerpen" and "Projets de loi" consecutively.

Each combination of 2 consecutive h2 tags that follows in this section, indicates a proposal that is discussed in the parliament. The reference at the end of the h2 tag can be used to link the proposal discussion and the motion.

All text after the h2 tags of a proposal and before the h2 tags of a next proposal and which follow the text "Discussion des articles" and "Bespreking van de artikelen", can be considered the description of the proposal as done in the plenary:

image

Additionally, we can opt to store the discussion of the proposal in the plenary. This is all text between "Discussion générale" / " "Algemene bespreking" and "Discussion des articles" / "Bespreking van de artikelen":

image

Additionally, the formal documents explaining the proposal in full detail can be fetched using this reference, from this page: https://www.dekamer.be/kvvcr/showpage.cfm?section=/flwb&language=nl&cfm=ListDocument.cfm. For example, for the 3849 reference of above points to https://www.dekamer.be/kvvcr/showpage.cfm?section=/flwb&language=nl&cfm=/site/wwwcfm/flwb/flwbn.cfm?legislat=55&dossierID=3849, which is the following:

image

Amendments

The above page points to the document with the initial proposal. But also to the formulated amendments. The example reference 3849/4 of before points to the 004 amendment (see the bottom of the above screenshot).

Note that from the proposal & amendments page, we can extract also interesting metadata, such as which politicians authored these proposals and amendments. This way, we can also investigate which politicians are very active proposal writers.

More on amendments:

image

Every motion is a vote on one proposal. Either just one vote on the proposal, or multiple votes on amendments, such as here:

image

Amendments are announced with the words "Vote sur l'amendement n". Again, for each amendment, at the end of the "Vote sur l'amendement..." sentence, a reference is mentioned to the proposal text, which contains the proposed amendment.

sandervh14 commented 1 month ago

Nearly finished. See https://github.com/transparentdemocracy/voting-data/tree/feature/issue-%2325-proposal-descriptions. I still need to make my unit test work fully, and run on the full set of plenary reports. But you can already have a look and apply the same thinking and find_next_siblings_before_tag() function to extraction of other info in other issues. (@karel1980 I think you said you were interested of such a find_next_siblings_before_tag() function.)

sandervh14 commented 1 month ago

Working unit test. Updated code: https://github.com/transparentdemocracy/voting-data/tree/feature/issue-%2325-proposal-descriptions I'll now test against all plenary reports and regenerate the full plenaries.json.

karel1980 commented 1 month ago

Thanks for your analysis. I've started work on an implementation based on many of these ideas.

The inconsistencies are still a PITA to deal with :)

e.g. ip125x.html

karel1980 commented 1 month ago

In ip162x.html There is only one vote, it's announced by "Naamstemming over de ordemotie". Etc.

It's going to be a matter of having a parsing strategy that does 'best effort' attempts at finding structure based using rules that we can extend and improve as we go.

sandervh14 commented 1 month ago

Yes. Best effort. I'd consider it a success if our code can process all plenary reports without runtime errors, even though it doesn't extract the info from some of them correctly. If it works "rather decently" for the reports of the last months I think we already have a good start... We can make it better over time.

sandervh14 commented 1 month ago

Updated code & updated plenaries json & markdown in main branch ( @Karel Vervaeke this is a merge of our merge branch back to main). Shows the website team our updated model:

image

To display "proposal discussions" in the plenaries, use the title of the first proposal of every proposal discussion. That's the main proposal to be discussed. I need to still improve my proposal processing. Only 10% of the plenary reports are successfully processed. There's one error that occurs often, so I should be able to improve it nicely in the coming days.

sandervh14 commented 1 month ago

This is done by now, in main branch. 90% of plenary reports are processed without running into exceptions on proposal discussion parsing. The remaining 10% are special cases to look at later.

sandervh14 commented 1 month ago

(motions extraction is still underway, but that was not the subject of this issue. See #22.)