Closed jaeltan closed 1 year ago
Just as a note, some of the issues with updating information on agreements for titles, spelling, and removing duplicated rows in memberships databases have been addressed for both manyenviron and manytrade. Although, now we are encountering some disparities between IDs in agreements and memberships databases.
code_agreements()
has been updated to generate treatyIDs from titles more accurately, so the manyIDs and treatyIDs in agreements database need to be re-generated to match with the IDs in the memberships database
Thank you @jaeltan for the help!
So the best course of action, in my opinion, is to re-standardise titles (without translating the few ones not in english since we removed the translating portion of standardise_titles()
long ago), re-code treaties, and re-condense "many" IDs for all the databases in {manytrade}
and {manyenviron}
. This way all the work done lately in {manypkgs}
to improve matching/avoid unnecessary duplication in integrated. I have just pushed some more changes to {manypkgs}
so that this can all be done with export_data()
, you just have to interactively choose to update IDs in all datasets in database.
In any case, I think we were not updating manyID's consistently only in {manyenviron}
due to translating issues. However, it appears that only ECOLEX has some treaty titles not in English (even though we use the "english" title column). So what I propose, for transparency and reproducibility, is adding an extra column in HUGGO for translated titles (we can use deeplr for that) and use titles coming from other datasets as they are.
@jhollway please let us know what you think or if you have any issues with this approach. Thank you.
The treatyIDs and manyIDs in the HUGGO and HUGGO_MEM datasets are inconsistent for some agreements with the same Title, Beg, Signature, and Force dates. A list of these agreements are in the attached csv file: manyenviron.csv
The issues mentioned above have mostly been solved. We now have 2 smaller scale issues (thank you @jaeltan for the help and for identifying these additional issues):
1 - manyIDs have different activities generated for the same bilateral agreement in the agreements and membership database (this is likely caused by fuzzy matching)... one solution could be to remove "activity" for the fuzzy matching?
2- Sometimes there is a linkage for the manyIDs in the agreements database but not in the memberships database... I am not sure why this is happening here.
If 'activity' is unreliable for fuzzy matching, then it would make sense to drop it. Can we do a sensitivity analysis to see what the effect of dropping it would be?
We (potentially) need computational tools to: