mysociety / theyworkforyou

Keeping tabs on the UK's parliaments and assemblies
http://www.theyworkforyou.com/
Other
224 stars 50 forks source link

Better parsing of nested sections #1750

Open ajparsons opened 6 months ago

ajparsons commented 6 months ago

Related to https://github.com/mysociety/parlparse/issues/171 - but I think can be improved just in display.

So there's something a bit off about how TWFY is parsing some complicated debates:

The navigation structure assumes: header starts, header ends, header starts, header ends.

But in practice, this is sometimes nesting:

e.g. https://www.theyworkforyou.com/debates/?id=2020-06-30d.191.3

logically contains all the votes in the following 'debates' - but these are separated off because of the new header.

While parliament groups brings them all in one page https://hansard.parliament.uk/Commons/2020-06-30/debates/581DFFF9-B3ED-4B76-9F51-A1F2325334A6/ImmigrationAndSocialSecurityCo-Ordination(EUWithdrawal)Bill

In practice, the problem I have is making the linking clearer between a vote and the debate.

Currently there isn't a good link the tree, because the parent debate just contains the text of the amendment (which is useful) but not the discussion - while the top level debate (which I guess we could link to instead), does not contain the vote itself.

dracos commented 6 months ago

The issue is how the parsing code detects (or doesn't) headings, which has always been an issue, see e.g. https://github.com/mysociety/parlparse/issues/53 . I think Parliament's is bad the other way, in that "New Clause 7" (the "heading" of the second vote on that page) is output as pure body text, with no real way of noticing it's something new.

If you look at the source https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2020-06-30d.xml you'll see we have it as:

<minor-heading id="uk.org.publicwhip/debate/2020-06-30d.269.0" nospeaker="true" colnum="269" time="18:00:00" url=""> New Clause 7 </minor-heading>
<minor-heading id="uk.org.publicwhip/debate/2020-06-30d.269.1" nospeaker="true" colnum="269" time="18:00:00" url=""> Time limit on immigration detention for EEA and Swiss nationals </minor-heading>

I thought there was code to combine two minor-headings like that together on import if it found them, but presumably there's not or it's not working in some way. I see why it might be nice to have them all on one page, but that does make large debates even more unwieldy. But you'd have to introduce more structure to the output if you wanted to do anything with this, I think, and it's never been worth the effort involved.

ajparsons commented 6 months ago

Yeah, I was specifically looking for debates with multiple votes to test a motion extractor - and that flushed out ones like this where things are more spread out than I expected.

If we sketched out (and funded) a project around clearer understanding of amendments and legislative process - a good approach to this would fit into it.