globalgov / manydata

The portal for global governance data
https://manydata.ch
GNU Affero General Public License v3.0
9 stars 0 forks source link

Check if existing functions are ready to update memberships database in manyenviron and manytrade #258

Closed jaeltan closed 1 year ago

jaeltan commented 1 year ago

We (potentially) need computational tools to:

henriquesposito commented 1 year ago

Just as a note, some of the issues with updating information on agreements for titles, spelling, and removing duplicated rows in memberships databases have been addressed for both manyenviron and manytrade. Although, now we are encountering some disparities between IDs in agreements and memberships databases.

jaeltan commented 1 year ago

code_agreements() has been updated to generate treatyIDs from titles more accurately, so the manyIDs and treatyIDs in agreements database need to be re-generated to match with the IDs in the memberships database

henriquesposito commented 1 year ago

Thank you @jaeltan for the help!

So the best course of action, in my opinion, is to re-standardise titles (without translating the few ones not in english since we removed the translating portion of standardise_titles() long ago), re-code treaties, and re-condense "many" IDs for all the databases in {manytrade} and {manyenviron}. This way all the work done lately in {manypkgs} to improve matching/avoid unnecessary duplication in integrated. I have just pushed some more changes to {manypkgs} so that this can all be done with export_data(), you just have to interactively choose to update IDs in all datasets in database.

In any case, I think we were not updating manyID's consistently only in {manyenviron} due to translating issues. However, it appears that only ECOLEX has some treaty titles not in English (even though we use the "english" title column). So what I propose, for transparency and reproducibility, is adding an extra column in HUGGO for translated titles (we can use deeplr for that) and use titles coming from other datasets as they are.

@jhollway please let us know what you think or if you have any issues with this approach. Thank you.

jaeltan commented 1 year ago

The treatyIDs and manyIDs in the HUGGO and HUGGO_MEM datasets are inconsistent for some agreements with the same Title, Beg, Signature, and Force dates. A list of these agreements are in the attached csv file: manyenviron.csv

henriquesposito commented 1 year ago

The issues mentioned above have mostly been solved. We now have 2 smaller scale issues (thank you @jaeltan for the help and for identifying these additional issues):

1 - manyIDs have different activities generated for the same bilateral agreement in the agreements and membership database (this is likely caused by fuzzy matching)... one solution could be to remove "activity" for the fuzzy matching?

2- Sometimes there is a linkage for the manyIDs in the agreements database but not in the memberships database... I am not sure why this is happening here.

jhollway commented 1 year ago

If 'activity' is unreliable for fuzzy matching, then it would make sense to drop it. Can we do a sensitivity analysis to see what the effect of dropping it would be?