Check if existing functions are ready to update memberships database in manyenviron and manytrade

jaeltan commented 1 year ago

We (potentially) need computational tools to:

signature, ratification, EIF, and end dates can be sourced from treaty texts or websites (refer to agreements$HUGGO) computationally in the first instance
update information on agreements (correct titles, remove duplicates) in memberships databases using verified information in agreements$HUGGO
resolve conflicts in dates within rows
correct spelling and formatting (this can be done with countryregex data and functions like manypkgs::standardise_title etc?)

henriquesposito commented 1 year ago

Just as a note, some of the issues with updating information on agreements for titles, spelling, and removing duplicated rows in memberships databases have been addressed for both manyenviron and manytrade. Although, now we are encountering some disparities between IDs in agreements and memberships databases.

jaeltan commented 1 year ago

code_agreements() has been updated to generate treatyIDs from titles more accurately, so the manyIDs and treatyIDs in agreements database need to be re-generated to match with the IDs in the memberships database

henriquesposito commented 1 year ago

Thank you @jaeltan for the help!

So the best course of action, in my opinion, is to re-standardise titles (without translating the few ones not in english since we removed the translating portion of standardise_titles() long ago), re-code treaties, and re-condense "many" IDs for all the databases in {manytrade} and {manyenviron}. This way all the work done lately in {manypkgs} to improve matching/avoid unnecessary duplication in integrated. I have just pushed some more changes to {manypkgs} so that this can all be done with export_data(), you just have to interactively choose to update IDs in all datasets in database.

In any case, I think we were not updating manyID's consistently only in {manyenviron} due to translating issues. However, it appears that only ECOLEX has some treaty titles not in English (even though we use the "english" title column). So what I propose, for transparency and reproducibility, is adding an extra column in HUGGO for translated titles (we can use deeplr for that) and use titles coming from other datasets as they are.

@jhollway please let us know what you think or if you have any issues with this approach. Thank you.

jaeltan commented 1 year ago

The treatyIDs and manyIDs in the HUGGO and HUGGO_MEM datasets are inconsistent for some agreements with the same Title, Beg, Signature, and Force dates. A list of these agreements are in the attached csv file: manyenviron.csv

henriquesposito commented 1 year ago

The issues mentioned above have mostly been solved. We now have 2 smaller scale issues (thank you @jaeltan for the help and for identifying these additional issues):

1 - manyIDs have different activities generated for the same bilateral agreement in the agreements and membership database (this is likely caused by fuzzy matching)... one solution could be to remove "activity" for the fuzzy matching?

2- Sometimes there is a linkage for the manyIDs in the agreements database but not in the memberships database... I am not sure why this is happening here.

jhollway commented 1 year ago

If 'activity' is unreliable for fuzzy matching, then it would make sense to drop it. Can we do a sensitivity analysis to see what the effect of dropping it would be?

globalgov / manydata

Check if existing functions are ready to update memberships database in manyenviron and manytrade #258