globalgov / manyenviron

Many data on environmental agreements
https://globalgov.github.io/manyenviron/
GNU Affero General Public License v3.0
6 stars 1 forks source link

Duplicate entries of agreements in different languages #76

Closed jaeltan closed 6 months ago

jaeltan commented 1 year ago

While updating the manyIDs, we made a decision to keep the original agreement titles instead of translating them to English in the ECOLEX and IEADB datasets. As a result, there are now some agreements where the same agreement is listed in both its original language and in its English version in the database, because the English versions were present in the GNEVAR datasets that we used as the base for HUGGO. I will add the new original titles to the HUGGO dataset for the sake of consistency, but we will have to decide what to do with the 'duplicate' entry in English. I see two options for this: we could either add the original agreement title in as a separate row, or create a new column for translated English titles and combine the two rows into one. I think I prefer the second option since the two agreements are the same and having separate rows would produce 'duplicates'. What do you think @jhollway @henriquesposito ?

henriquesposito commented 1 year ago

Thank you very much @jaeltan for looking into this. Indeed, we had so many issues with the translation (and APIs) that we dropped the feature when standardising titles. If I remember correctly, this was an issue only for a few observations that were not translated even though we always used the "english" title variable in ECOLEX, for example. I think an option is to "explicitly" translate the title in the preparation scripts but leave the original title as an extra variable in the datasets in which we do have non-English titles.

jaeltan commented 1 year ago

Thanks @henriquesposito for your suggestion! Yes there are only a few observations that are affected. I think adding an extra variable for titles that are not in English is a viable option. We could also add the variable into HUGGO so that we can still identify the treaties that had titles in different languages without adding them as separate observations.

jhollway commented 1 year ago

OrigTitle or similar would be a good variable (i.e. extra column) to add where it doesn't already exist (I think we had something similar in the GENG database?).