ropensci / c14bazAAR

R Package - Download and Prepare C14 Dates from Different Source Databases
https://docs.ropensci.org/c14bazAAR
GNU General Public License v2.0
30 stars 12 forks source link

Country thesaurus includes entries that aren't countries #117

Closed joeroe closed 2 years ago

joeroe commented 4 years ago

I noticed that fix_database_country_name() currently returns some country names that aren't actually (or currently) countries:

This might not be a bug exactly—these labels do contain useful geographic information—but I wouldn't say it's expected behaviour since the name of the function implies a sovereign country will be returned. So if they remain in the thesaurus, maybe that should be documented somewhere?

I also spotted some inconsistencies in names for the same country:

And some mistakes in the corrected names:

nevrome commented 4 years ago

Thank you very much for pointing out these issues!

fix_database_country_name() uses two methods to determine "correct" country names:

  1. Comparison with a manually curated thesaurus table
  2. Fuzzy matching to countrycode::codelist[["country.name.en"]]

The following issues were wrong entries in the thesaurus and just fixed in b31ef977a28def0dd55b49246fa17bae8cb79001:

Corsica Crete Czechia/Czech Republic Lybia (Libya?) Tunesia (Tunisia?)

The other mentioned entities (Channel Islands, Sardinia, Yugoslavia) do exist in countrycode::codelist[["country.name.en"]] and as this is our major point of reference I think it's better to keep them as they are.

Yugoslavia is a clear case, because this former country covers multiple modern countries and it would not be clear to which one the respective date should be attributed. I think for the purpose of this function we should keep sunken empires alive :smile:. I'm less sure about Sardinia and Channel Islands, though.

joeroe commented 4 years ago

Thanks! Definitely agree with leaving Yugoslavia. The Channel Islands are weird, and Sardinia has been part of Italy for a long time...

I spotted these when trying to join the full database to Natural Earth (which uses ISO/FIPS names) based on country names. Had I read the fix_database_country_name() documentation properly, I would have spotted that I could just have standardised them with countrycode::countryname(), so yes consistency is definitely the winner!

nevrome commented 4 years ago

Hm - we could omit some rows of countrycode::codelist for our lookup to get rid of Sardinia and other less useful entries. What do you think, @dirkseidensticker ?

nevrome commented 3 years ago

I had a shower thought regarding this issue: We could very well get rid of countrycode::codelist and instead extent our own country thesaurus. We would probably not loose much, but gain full control over the country list. Would also make our code more simple.

nevrome commented 3 years ago

We decided to indeed implement this suggested change. @yesdavid offered to implement a first draft :+1:

nevrome commented 2 years ago

This is not relevant any more, as we deprecated fix_database_country_name() in v.3.0 without a replacement.