skip names we suspect were sourced from machine transliteration

missinglink commented 2 years ago

this PR skips importing name:* properties for records we determine to be of low quality due to machine transliteration.

note: this issue is discussed in more detail in https://github.com/whosonfirst-data/whosonfirst-data/issues/799

the main issue with these names is that they fill the index with a lot of garbage, one good example is Сан-Франциск (San Francisco) which has 15,806 occurrences in the database.

so while Сан-Франциск is a valid transliteration for the SF in CA... it's then been copied to every other place named 'San Francisco', for which there are numerous in Spanish speaking countries such as Philippines and South American nations.

the heuristic here is to first look for a translation into Volapük, this is an obscure constructed language spoken by ~20 people but has a very active transliteration bot, this is a very strong indicator of machine transliteration.

this alone is fairly good indication that the record has sourced most of its names from Wikidata. However; some of these names may in fact be valid translations (as in the case of SF CA, which has a vol name), so we add a simple population check which only applies when the population is less than 2000 (tunable).

my general thesis here (which may or may not be universally true) is that it's increasingly rare to find exonyms for places as you walk further down the administrative hierarchy.

for instance, at the country level you'll see things like the endonym de:Deutschland having multiple exonyms such as fr:Allemagne and en:Germany. As you get further down you'll see some exonyms for old major cities such as Londres, Cologne etc. often due to historic trade in the area. As you get down to towns and neighbourhoods its more common to see local colloqualisms, abbreviations and slang than new foreign words for a place.

I feel like this is mostly true for translations (ie. "translating words or text from one language into another") but not true for transliteration (ie. "transferring a word from the alphabet of one language to another").

In testing I found no regressions to the test suites and a reduction of the database size of ~10% and much improved performance of worst-case slow queries.

missinglink commented 2 years ago

This change is applied at import time so it requires rebuilding the index in order to test it.

orangejulius commented 2 years ago

Cool, this looks like a very robust and effective change. Really excited to test out the performance improvements :)

orangejulius commented 2 years ago

Ok, I ran this through our full testing suite and overall it looks fine. There were a few cities that now come up differently in some tests, but I don't think it's anything major and we can probably fix them if we'd like.

Some examples are:

the locality for the Vatican no longer has the Vatican City translation in English (It's marked as a megacity so maybe we can fix that one easily)
~Lucerne no longer has the English Lucerne translation (which I think is an Endonym but either way)~. I read this one wrong, it was actually an improvement :)

Maybe we can convince the WOF team to remove the vol_ translation from the Lucerne record?

Either way as far as this PR is concerned I assume we are good to go. Performance testing still remains, stay tuned.

pelias / placeholder

skip names we suspect were sourced from machine transliteration #214