Closed missinglink closed 4 years ago
I'm going to try and reduce the amount of extra languages we're importing in another PR but this is a nice first step which should have no negatives.
Looking at something like https://raw.githubusercontent.com/whosonfirst-data/whosonfirst-data-admin-de/master/data/856/824/99/85682499.geojson the term "Berlin" appears many times in many languages, we don't need to index those other versions 🤷♂️
I found similar functionality in the WOF importer here: https://github.com/pelias/whosonfirst/blob/fee549816a8a29fc5c3daccc66129677f8d552d6/src/components/extractFields.js#L144
It doesn't hurt to have it twice but I think it's better to have it here since it applies to all sources including custom data (which was why the bug was originally reported)
This PR adds a new 'post processing' script which aims to delete any names stored in language fields which are duplicated in the
default
language.From the code comments:
This has the benefit of reducing the index size and also any TF/IDF penalties (or scoring boosts!) which may result in matching terms multiple times in different languages.
These names are not used for display (that's the job of the language service) so it should have no negative effect.
@Joxit you're most familiar with the language fields, does this look :+1: to you?