openaq / openaq-ingestor

1 stars 1 forks source link

location name romanization #9

Open russbiggs opened 1 year ago

russbiggs commented 1 year ago

It would be nice to add a romanization version of location names for locations that contain non-roman characters (CJK, Arabic, Hebrew, Cyrillic etc.). The goal is not to translate names, since many of these are proper nouns and are not suited for translation.

e.g. 아산시청 -> asan-si cheong

The functionality would be additive only and would not replace the original name, so some consideration about the the romanized name is stored in the DB is also needed.

In the ingestion process I proposed this will be a two step process:

  1. Identify if a name has non-roman characters
  2. romanize the characters

For the first step, as long as the names are coming in as unicode it seems like we can scan for matches across the different language character ranges. e.g. https://stackoverflow.com/a/50434862 and then identify the general character set.

For the romanization it seems like using individual libraries for each language/character group may be the best approach. I can't find a one-size-fits all library. One issue we will need to consider is in the case of some character sets where the characters have different sounds per language we may need to identify the language, beyond just character set used. e.g. Persian uses arabic characters but may have different romanized outputs, or any of the many language that use Cyrillic (Russian, Ukranian, Bulgarian, Mongolian.)

Any thoughts? @caparker @majesticio