transparentdemocracy / voting-data

Voting behavior data extracted from plenary reports of the Belgian federal government.
5 stars 1 forks source link

Extracted proposal ID and motion ID should be different (motion extraction reworking) #22

Closed sandervh14 closed 1 month ago

sandervh14 commented 1 month ago

They have been assigned one and the same ID in the past week or so, but both proposal and motion have a separate ID in the plenary reports.

Due to the implementation currently, we save proposal and motion as a 1-1 relationship. This means that if a proposal got several motions (for amendments), we will create duplicate proposals in our JSON output data to the website team.

Related to this, and would be nice to cover this as well in the meantime: several motions can be done on one proposal. If we would capture this, we will store amendment votes as well.

Would also be nice and may even follow logically from the above work: pulling apart the creation of motions and proposals. They are now heavily intertwinedly created in the same method in extraction.py.

karel1980 commented 1 month ago

I'm thinking about doing something like this to improve parsing. I'll think about how to bring this together.

def extraction_experiment():
    html = parse_html(os.path.join(PLENARY_HTML_INPUT_PATH, "ip298x.html"))

    def has_border(s):
        return s and ("border:solid" in s)

    spans = html.find_all("span", attrs=dict(style=has_border))

    DETAIL_NOMINATIFS = html.find("span", "DETAIL DES VOTES NOMINATIFS")

    for a, b in zip(spans[:], spans[1:] + [DETAIL_NOMINATIFS]):
        print("------------------------------")
        print("found numeric marker", a.text)
        for e in elements_between(a, b):
            print(e.text)

def elements_between(element1, element2):
    elements = []
    current_element = element1

    while current_element != element2:
        current_element = current_element.find_next()
        if current_element is None:
            break

        # avoid copying script tags, that could be bad
        if current_element.name == "script":
            continue

        elements.append(current_element)

    return elements
karel1980 commented 1 month ago

Before continuing and changing the model I'd like to make sure I understand things better.

lexicon explains some terms like motion and proposal, but finding out which is which in the plenary reports is still unclear to me.

The numbers with solid black border indicate different sections, but they're not the key to understanding the document structure... E.g. The section "Naamstemmingen" in plenary 298 starts between 09.01 and 09.02. They usually indicate a change of speaker, but sometimes the speaker is announced beforehand (particularly the chairman)

I'm going to sleep on it, but perhaps a short sparring session would also be productive

karel1980 commented 1 month ago

More observations: Plenary 298, item 10 mentions 3 motions, but only 1 is voted on because it has legal priority (sorry if this is a naive translation). The other 2 motions are 'vervallen / caduques'

Item 14 has many amendments, many have votes, but other simply reuse the result of a previous vote (e.g. stemming/vote 6). The text makes this clear. For parsing, the actual vote is in a table with a border around it - the 'reused' vote just contain a reference to the earlier vote.

sandervh14 commented 1 month ago

Interesting, I had not taken the time to read that lexicon yet, but it's worth it. We can inspire our next work with a better understaning based on that.

Thanks for your pointers. I used to think that the numbers "in borders" in the vote section agreed with the numbers in the section with the propositions. I see now that this does not always agree.

On the other hand, when looking at the text better now too, I think finding referenced documents will be "easier" than I thought. :-)

Yes, let's do the sparring session. I even ideally would like to do it in a Sherlock-style with print and a red marker indicating stuff and making links etc, but it's a lot of pages. Let's just discuss tomorrow and put our understanding together.

sandervh14 commented 1 month ago

We're currently working on this. The implementation of motion extraction is being re-worked, see main branch.

sandervh14 commented 1 month ago

This is implemented now. See https://github.com/transparentdemocracy/voting-data/blob/main/data/output/plenary/json/plenaries.json. Motions will have an ID like 55_160_mg_19_m0, proposals will have an ID like 55_160_d13_p0.