Closed sandervh14 closed 1 month ago
Not sure if this was ever very good, finetuning this is definitely still high on the list
Maybe, I wasn't sure. :D Good we are aligned. :-)
I'll take this one up as first thing.
+ request from Guido: separate NL an FR
Completely separating them doesn't make sense because text isn't always available in both languages. It may be possible to derive a lot of information from the HTML structure, but I don't want to become too dependent on that.
I'd keep the HTML elements while extracting. This way we keep the classes or lang attributes that give information about the language.
On Fri, 3 May 2024 at 17:45, Sander Vanden Hautte @.***> wrote:
- request from Guido: separate NL an FR
— Reply to this email directly, view it on GitHub https://github.com/transparentdemocracy/voting-data/issues/25#issuecomment-2093271628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABFLPNYF36UG3H4O7AJBHTZAOWH3AVCNFSM6AAAAABHBFZM66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGI3TCNRSHA . You are receiving this because you commented.Message ID: @.***>
OK. I'll see what I can do about the NL/FR thing. I'll keep your thoughts of above in mind.
Preparation of the reworking required to detect the correct descriptions of proposals, linked to voted motions.
Investigated on ip298x.html.
Motions follow after two h1 tags, the first h1 tag containing "Naamstemmingen", the second h1 tag containing "Votes nominatifs":
Consider the following example:
Each motion can be recognized as two consecutive h2 tags, the first h2 tag always containing the title of the motion in French, within the span[lang="FR"] tag, the second h2 tag always containing the title of the motion in Dutch, within the span[lang="NL"] tag.
Note: Contrary to what I first thought, the number in a box is not a reference to find the corresponding proposal text. It is just a sequence number indicating a new motion to be voted on.
At the end of the motion title, a reference between parenthesis follows, pointing to the proposal document. For the above screenshot, that is 3849/4.
The proposal document reference can be used to find the discussion of the proposal in the plenary report itself, a first indication of what the proposal is about. It can be found after the sequence of the 2 h1 tags with text "Wetsontwerpen" and "Projets de loi" consecutively.
Each combination of 2 consecutive h2 tags that follows in this section, indicates a proposal that is discussed in the parliament. The reference at the end of the h2 tag can be used to link the proposal discussion and the motion.
All text after the h2 tags of a proposal and before the h2 tags of a next proposal and which follow the text "Discussion des articles" and "Bespreking van de artikelen", can be considered the description of the proposal as done in the plenary:
Additionally, we can opt to store the discussion of the proposal in the plenary. This is all text between "Discussion générale" / " "Algemene bespreking" and "Discussion des articles" / "Bespreking van de artikelen":
Additionally, the formal documents explaining the proposal in full detail can be fetched using this reference, from this page: https://www.dekamer.be/kvvcr/showpage.cfm?section=/flwb&language=nl&cfm=ListDocument.cfm. For example, for the 3849 reference of above points to https://www.dekamer.be/kvvcr/showpage.cfm?section=/flwb&language=nl&cfm=/site/wwwcfm/flwb/flwbn.cfm?legislat=55&dossierID=3849, which is the following:
The above page points to the document with the initial proposal. But also to the formulated amendments. The example reference 3849/4 of before points to the 004 amendment (see the bottom of the above screenshot).
Note that from the proposal & amendments page, we can extract also interesting metadata, such as which politicians authored these proposals and amendments. This way, we can also investigate which politicians are very active proposal writers.
More on amendments:
Every motion is a vote on one proposal. Either just one vote on the proposal, or multiple votes on amendments, such as here:
Amendments are announced with the words "Vote sur l'amendement n". Again, for each amendment, at the end of the "Vote sur l'amendement..." sentence, a reference is mentioned to the proposal text, which contains the proposed amendment.
Nearly finished. See https://github.com/transparentdemocracy/voting-data/tree/feature/issue-%2325-proposal-descriptions. I still need to make my unit test work fully, and run on the full set of plenary reports. But you can already have a look and apply the same thinking and find_next_siblings_before_tag() function to extraction of other info in other issues. (@karel1980 I think you said you were interested of such a find_next_siblings_before_tag() function.)
Working unit test. Updated code: https://github.com/transparentdemocracy/voting-data/tree/feature/issue-%2325-proposal-descriptions I'll now test against all plenary reports and regenerate the full plenaries.json.
Thanks for your analysis. I've started work on an implementation based on many of these ideas.
The inconsistencies are still a PITA to deal with :)
e.g. ip125x.html
In ip162x.html There is only one vote, it's announced by "Naamstemming over de ordemotie". Etc.
It's going to be a matter of having a parsing strategy that does 'best effort' attempts at finding structure based using rules that we can extend and improve as we go.
Yes. Best effort. I'd consider it a success if our code can process all plenary reports without runtime errors, even though it doesn't extract the info from some of them correctly. If it works "rather decently" for the reports of the last months I think we already have a good start... We can make it better over time.
Updated code & updated plenaries json & markdown in main branch ( @Karel Vervaeke this is a merge of our merge branch back to main). Shows the website team our updated model:
To display "proposal discussions" in the plenaries, use the title of the first proposal of every proposal discussion. That's the main proposal to be discussed. I need to still improve my proposal processing. Only 10% of the plenary reports are successfully processed. There's one error that occurs often, so I should be able to improve it nicely in the coming days.
This is done by now, in main branch. 90% of plenary reports are processed without running into exceptions on proposal discussion parsing. The remaining 10% are special cases to look at later.
(motions extraction is still underway, but that was not the subject of this issue. See #22.)
Many proposal descriptions look no longer high-quality, it may be a regression I just introduced this evening, or it may have been in there already for a couple of days.
This impacts what the website team gets to see on their prototype. Our prototype is only as clear as the descriptions of the proposals voted for.
Example: