sshaw / normalize_country

Convert country names and codes to a standard.
67 stars 8 forks source link

Problems with punctuation and order #8

Open wu-lee opened 3 years ago

wu-lee commented 3 years ago

I've a dataset which has some problematic names. Specifically:

The en.yml data file contains these relevant entries:

PS:
  aliases:
  - Palestinian Territories
  - Palestinian Territory
  alpha2: PS
  alpha3: PSE
  fifa: PLE
  ioc: PLE
  iso_name: Palestinian Territory, Occupied
  numeric: "275"
  official: State of Palestine
  short: Palestine
  emoji: "\U0001F1F5\U0001F1F8"
  shortcode: ":flag-ps:"
  alpha2: CI
  alpha3: CIV
  fifa: CIV
  ioc: CIV
  iso_name: Côte D'Ivoire
  numeric: "384"
  official: Republic of Côte D'Ivoire
  short: Ivory Coast
  emoji: "\U0001F1E8\U0001F1EE"
  shortcode: ":flag-ci:"

So it's a "close but no cigar" situation in both cases. I'm not sure how to solve this.

I'm wondering if the library should erase punctuation and flatten to ASCII when comparing? This would handle the different choice of apostrophe and any missing/altered accents in Côte D'Ivoire, but perhaps that goes too far. I can't currently think of country names it would break, but that's not saying they wouldn't be. And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).

There are other names with an apostrophe. These are going to be problematic, considering the general populace's facility with using punctuation. Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau or accents as in Åland Islands, and just alternative spellings like Faeroes.

Palestine, State of does what some of the other names do, putting the main name first and any qualifiers like "State of" after a comma. But it doesn't match in this case. I think this is harder; removing punctuation is one thing, re-arranging word order is another.

I see elsewhere in en.yml there are aliases. Perhaps that's a better solution, adding a lot of aliases?

sshaw commented 3 years ago

Hi, thanks for bringing this to my attention.

I'm wondering if the library should erase punctuation and flatten to ASCII when comparing?

I think this will require a library to address the tricky cases, for example ß to ss. iconv can do this but one thing that is nice is this gem has not dependencies.

And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).

In the US this is common for names. For example, Gary Dell'Abate uses the ' (Italian) or Pedro Muñoz uses the ñ (Spanish). I also see US papers using São Paulo, Malmö, etc...

I see elsewhere in en.yml there are aliases. Perhaps that's a better solution, adding a lot of aliases?

Seems to be the best. I can see this becoming unmanageable but seems that we're far away from that.

Given you examples and similar existing cases, we should have aliases with non-ascii apostrophe and Palestine, State of variants. But one question here: is this name part of a standard somewhere? Not sure how to apply to others. We have some already and others no. For example: State of Israel but not Israel, State of.

Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau

I don't think mdash or endash is appropriate here. Are there other Unicode dashes that should be covered?

Is there a use case for the name without a dash? I can see it both ways and don't have an issue having an alias without it.

or accents as in Åland Islands

Here I think it's fine to add an ASCII alias too

and just alternative spellings like Faeroes.

Yeah this should be an alias too.

sshaw commented 3 years ago

Checkout master for some updates to this.

Is Palestine, State of part of a standard somewhere?

wu-lee commented 3 years ago

Thanks, will check, maybe I can remove some hacks!

The Carmen gem mentioned in the Readme for this project uses the Debian ISO-3166-1 data as a source: and I notice that data includes "Palestine, State of". I just happen to know where to find that - I've not gone to the ISO standard itself to check, which is presumably the most definitive.

My current use-case is to create a SKOS vocabulary of terms (in RDF) for the International Coop Association's database of members' locations and/or territories. Their data is notionally based on the ISO-3166-1 country code system, but they currently use English language labels instead of IDs in their database, which we need to convert to country codes, Their particular set of labels they have includes "Palestine, State of" and "Côte d’Ivoire" with the non-ASCII backquote. I'm not sure where these labels come from originally. I would hazard a guess that the backquote may have been automatically inserted by Word or Excel or something similar.

sshaw commented 3 years ago

The Carmen gem mentioned in the Readme for this project uses the Debian ISO-3166-1 data as a source: and I notice that data includes "Palestine, State of".

Thanks. At some point I will check that data to make sure it's included.

sshaw commented 3 years ago

Note to self: https://github.com/sshaw/normalize_country/pull/9#issuecomment-807647593