pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

major synonyms upgrade #453

Closed missinglink closed 4 years ago

missinglink commented 4 years ago

This is a massive overhaul of the synonyms list, adding hundreds of new ones for English, French, Spanish and German.

It can probably replace these PRs:

After the success of https://github.com/pelias/schema/pull/446 I'm going to do a full planet build of this and see how it performs.

cc/ @Joxit I added loads of French synonyms (no multi-word synonyms for now).

I wouldn't suggest merging this until after https://github.com/pelias/api/pull/1444 because applying all these street synonyms at query-time when they aren't required is surely not going to be great from a performance perspective.

missinglink commented 4 years ago

I've fixed the integration tests by loosening the test assertion method.

Previously it would ensure that the tokens listed in the $expected array were equal to the tokens in the elasticsearch index. Now it's ensuring that the tokens listed in the $expected array are all present in the elasticsearch index.

I think this is preferable going forward as some of the tests were becoming brittle and hard to maintain.

The tests helped me catch some missing directional synonyms for english which I've fixed up. There was previously a synonym for see, s (en:Lake) which I have removed since I've not seen that one in the wild.

missinglink commented 4 years ago

I just added https://github.com/pelias/schema/pull/453/commits/b537967c84b4749e17dc0eeaaf5db1f7a4983a22 to ensure https://github.com/pelias/schema/pull/452 is being addressed correctly.

missinglink commented 4 years ago

Looks good to me!

https://www.diffchecker.com/OqkB9xw4

Some improvements from our existing acceptance test suite:

orangejulius commented 4 years ago

Okay, I've run a bunch of acceptance tests against this change and the results are about as we might have expected:

On the positive side, there are many many queries that didn't work at all due to abbreviation issues that are now passing.

However, there are also a few tests that showing regressions. Overall, there were more regressions than improvements, so total test pass rate actually went down (for autocomplete tests at least).

Regressions

A good example of the type of regressions we can see are queries like 15 Call Street Screenshot_2020-06-22_15-18-06

The c, calle, call, [etc] synonym line from the new Spanish file is probably what's causing this particular issue. We knew that single character synonyms are especially dangerous, but c is very commonly used as an abbreviation for calle in Spain, so :shrug:.

Performance

There's also almost certainly a performance impact to this change: I wasn't able to run the regular autocomplete acceptance tests, even very slowly, on our standard sized dev Elasticsearch instance. I was seeing timeouts and had to jump to a larger instance size. We'll need to do some more testing to quantify exactly what the difference is though.

Going forward

I think at the the very least we'll want to back out some of the new synonyms in this PR, particularly any single character synonyms. While we definitely want a system to support c/calle and other abbreviations, I don't believe Elasticsearch synonyms are the way to do that. We will need something that allows us to only handle that abbreviation for results in areas where we know that abbreviation is common, such as Spain.

I think a way we can do this is to move towards a system where we are generating multiple Elasticsearch documents from a single record in our data, and generating variations of the input at query time. This sort of system is already under discussion for, and would solve, our current top issue with alt names and scoring penalties. While more complex, it would allow us to handle abbreviations with more precision and flexibility.

Edit: another thing that might let us buy a little more time with the current system is to use synonym expansion (at index time only, so we'll need https://github.com/pelias/api/pull/1444) so that only calle is replaced with calle, c (something like calle => calle, c, call, etc).

missinglink commented 4 years ago

Nice testing notes! just so we're on the same page, are those perf/quality measurements with or without https://github.com/pelias/api/pull/1444?

orangejulius commented 4 years ago

@missinglink I actually ended up running everything with and without https://github.com/pelias/api/pull/1444. There's only minor differences in the results, but we should examine them more closely too.

missinglink commented 4 years ago

Okay cool, let's chat about it on a call, I just added https://github.com/pelias/api/pull/1444/commits/9d4c6db2be4b22dbfadd0d6276d593a52eb8d208, prior to that the peliasQuery change was only for search.

I also just added https://github.com/pelias/schema/pull/453/commits/e135146c75402092f0d5f53bc6865bd2103c04b4 here to remove those single-letter street synonyms.

missinglink commented 4 years ago

I discovered an undesirable behaviour...

As mentioned above in https://github.com/pelias/schema/pull/453#discussion_r443466161 I was pleased to find that synonyms dont 'explode' (ie they don't apply virally).

Turns out that this is only true within a token filter, so it's true that street, st straße, st will not create an association street == straße.

There problem is that where we have saint, st in another token filter then indexing saint will result in saint, st, street, straße, ...

And to add to the confusion this is dependent on the order which the token filters are specified in the pipeline.

This makes sense logically because if the synonyms/personal_titles filter emits a st token and passes it to the synonyms/streets filter, I would have preferred at this point that since the token is of type SYNONYM that it wouldn't generate more synonyms, but this is not the case 🤷

Good news is that I've been looking for an excuse to use the multiplexer tokenfilter and this is the perfect situation.

How the multiplexer works is that it takes all the tokens in and then runs all the filters independently of each other and pipes the resulting tokens back to the main pipeline.

So where the the regular token filter pipeline is sequential the multiplexer is synchronic.

This is all fixed up in https://github.com/pelias/schema/pull/453/commits/7b018012b9e60ee38a8cbf018d3d6ad57b1308d7 I also cleaned up some of the logic of which filter runs when, to make it more consistent.

missinglink commented 4 years ago

That was a bit wordy, here's a picture:

# before

elyzer --index pelias --analyzer peliasStreet 'saint'
...
{0:saint,st,street,str,stre,stree,strt}
# after

elyzer --index pelias --analyzer peliasStreet 'saint'
...
{0:saint,st}
missinglink commented 4 years ago

rebased against master

orangejulius commented 4 years ago

Just ran our tests against the latest build from this branch, and the results are positive! Of course queries like r gay lussac, paris no longer work, because we've disabled those single character synonyms (for now), but otherwise things are good.

I also confirmed that this PR pairs really well with https://github.com/pelias/api/pull/1444 where we use the peliasQuery analyzer at search time. With that PR, our abbreviation suite shows lots of improvements: Screenshot_2020-06-26_08-54-29

With the master branch of API as a baseline, there are actually a few regressions, since we end up being a bit more loose with synonyms than we'd currently like: Screenshot_2020-06-26_08-54-44

orangejulius commented 4 years ago

Okay so, after quite a bit of testing of both result quality and performance, we are happy with this PR.

From our tests, 99th percentile response times for the autocomplete endpoint increase by about 7-8%. This is notable but not significant enough to avoid merging this PR. Otherwise, results are much improved and it also includes a foundation on which to add more synonyms as needed. :tada: