Closed missinglink closed 2 years ago
This change is applied at import time so it requires rebuilding the index in order to test it.
Cool, this looks like a very robust and effective change. Really excited to test out the performance improvements :)
Ok, I ran this through our full testing suite and overall it looks fine. There were a few cities that now come up differently in some tests, but I don't think it's anything major and we can probably fix them if we'd like.
Some examples are:
Vatican City
translation in English (It's marked as a megacity so maybe we can fix that one easily)Lucerne
translation (which I think is an Endonym but either way)~. I read this one wrong, it was actually an improvement :)Maybe we can convince the WOF team to remove the vol_
translation from the Lucerne record?
Either way as far as this PR is concerned I assume we are good to go. Performance testing still remains, stay tuned.
this PR skips importing
name:*
properties for records we determine to be of low quality due to machine transliteration.the main issue with these names is that they fill the index with a lot of garbage, one good example is
Сан-Франциск
(San Francisco) which has15,806
occurrences in the database.so while
Сан-Франциск
is a valid transliteration for the SF in CA... it's then been copied to every other place named 'San Francisco', for which there are numerous in Spanish speaking countries such as Philippines and South American nations.the heuristic here is to first look for a translation into Volapük, this is an obscure constructed language spoken by ~20 people but has a very active transliteration bot, this is a very strong indicator of machine transliteration.
this alone is fairly good indication that the record has sourced most of its names from Wikidata. However; some of these names may in fact be valid translations (as in the case of SF CA, which has a
vol
name), so we add a simple population check which only applies when the population is less than 2000 (tunable).my general thesis here (which may or may not be universally true) is that it's increasingly rare to find exonyms for places as you walk further down the administrative hierarchy.
for instance, at the country level you'll see things like the endonym
de:Deutschland
having multiple exonyms such asfr:Allemagne
anden:Germany
. As you get further down you'll see some exonyms for old major cities such asLondres
,Cologne
etc. often due to historic trade in the area. As you get down to towns and neighbourhoods its more common to see local colloqualisms, abbreviations and slang than new foreign words for a place.I feel like this is mostly true for translations (ie. "translating words or text from one language into another") but not true for transliteration (ie. "transferring a word from the alphabet of one language to another").
In testing I found no regressions to the test suites and a reduction of the database size of ~10% and much improved performance of worst-case slow queries.