ngageoint / hootenanny

Hootenanny conflates multiple maps into a single seamless map.
GNU General Public License v3.0
357 stars 74 forks source link

To English Tag Translation #2330

Closed bwitham closed 3 years ago

bwitham commented 6 years ago

I didn't actually realize this until now, but the translator is fed off of an internal word mappings file that we maintain, and there's not much in it. Its an update as needed type of setup. It would be nice to have some more extensive translation capabilities available to get more mileage out of the language translator. How to do that effectively both performance and maintenance wise, though, I'm not sure about at this point.

Use cases:

1) Translation from any supported single language to English - e.g. shoebox data or untranslated OpenStreetMap for name tags only for a configurable list of tags 2) Translation from any supported multiple set of languages simultaneously to English - e.g. when creating the implicit tag database the data is the whole planet, so we need to try more than one language to get a translation ** This one may not be feasible.

bwitham commented 6 years ago

Having done a little research, there are some open source statistical matching translation capabilities available. Joshua looks nice b/c they have published several English language packs, so that you wouldn't have to train your own models. I'm going to guess that the overall translation quality won't be that of Google Translate, etc. However, we're really only interested in translating names, so that seems simpler translation-wise to me than having to translate entire texts full of longer phrases....could definitely be wrong about that, though.

The first question that would have to be asked is if it makes any sense to integrate something like that into hoot. You could make the argument for no, and just pre-translate all your data. That would, obviously, involve an extra step and you'd have to write glue code to handle the OSM format as input to the translator (translating names only).

If it was integrated into hoot, it does not seem feasible to do without declaring the language you want to translate from individually for each dataset. If you don't do that, the you'd have to run a translation over every language pack available, which would have to be very time consuming. Joshua lets you run multi-threaded on a local server, where you could have each thread handle a different language...that seems a little crazy as an approach to me, though. Currently, Joshua has over 50 languages supported so not sure its feasible. Also, each language pack is up to a few gigs of data more than many would want to mess with downloading as part of hoot...maybe. Specifying a single from language would work well for a lot of datasets, but would not work as well for the implicit tagging use case where we're trying to translate names over the entire world.

Another option might be to get a list of top English words (excerpt of one by Google) and use an external batch translation tool to build a simple word mapped text file against those subset top English words for every language you want to support. During translation, first check a name against the excerpt list to see if all the words are already English. If they are not, then you could use all of those mappings within hoot's translator, which is already set up for that translation dictionary format. Still could be computationally expensive but wouldn't involve an extra app to deploy, a bunch of large language packs to download, and the code would be overall simpler. The Google list on github is 10k words x 50+ languages = a lookup table of half a million. For a small amount of translated words maybe not that expensive, but probably would add up for a larger amount of names....but maybe not if most of your names were already in English (words already in English would only be looking up against the 10k word set). It does get around the specifying of a single language problem, since you're putting all the languages into one dictionary file. It would not be a foolproof to English translation solution, though, as you're only supporting a subset of the English language.