add language_field_trimming post processing script

missinglink commented 4 years ago

This PR adds a new 'post processing' script which aims to delete any names stored in language fields which are duplicated in the default language.

From the code comments:

/**
 * Language field post-processing script ensures that language tokens
 * present in the 'default' language are not duplicated in other languages.
 *
 * By default Pelias searches on the `name.default` field, and in some cases
 * it additionally searches on the language of the browser agent.
 *
 * This means that any name which exists in `name.default` need not additionally
 * exist in any of the other language fields.
 *
 * The benefits of this are that we can reduce the index size and any TF/IDF penalties.
 *
 * Example: the term 'Berlin' is indexed in *both* `name.default` and `name.de`.
 * In this case the `de` copy of the string 'Berlin' can be removed as it offers no value.
 */

This has the benefit of reducing the index size and also any TF/IDF penalties (or scoring boosts!) which may result in matching terms multiple times in different languages.

These names are not used for display (that's the job of the language service) so it should have no negative effect.

@Joxit you're most familiar with the language fields, does this look :+1: to you?

missinglink commented 4 years ago

I'm going to try and reduce the amount of extra languages we're importing in another PR but this is a nice first step which should have no negatives.

Looking at something like https://raw.githubusercontent.com/whosonfirst-data/whosonfirst-data-admin-de/master/data/856/824/99/85682499.geojson the term "Berlin" appears many times in many languages, we don't need to index those other versions 🤷‍♂️

missinglink commented 4 years ago

I found similar functionality in the WOF importer here: https://github.com/pelias/whosonfirst/blob/fee549816a8a29fc5c3daccc66129677f8d552d6/src/components/extractFields.js#L144

It doesn't hurt to have it twice but I think it's better to have it here since it applies to all sources including custom data (which was why the bug was originally reported)

pelias / model

add language_field_trimming post processing script #135