Closed missinglink closed 4 years ago
I've fixed the integration tests by loosening the test assertion method.
Previously it would ensure that the tokens listed in the $expected array were equal to the tokens in the elasticsearch index. Now it's ensuring that the tokens listed in the $expected array are all present in the elasticsearch index.
I think this is preferable going forward as some of the tests were becoming brittle and hard to maintain.
The tests helped me catch some missing directional synonyms for english which I've fixed up.
There was previously a synonym for see, s
(en:Lake) which I have removed since I've not seen that one in the wild.
I just added https://github.com/pelias/schema/pull/453/commits/b537967c84b4749e17dc0eeaaf5db1f7a4983a22 to ensure https://github.com/pelias/schema/pull/452 is being addressed correctly.
Looks good to me!
https://www.diffchecker.com/OqkB9xw4
Some improvements from our existing acceptance test suite:
Okay, I've run a bunch of acceptance tests against this change and the results are about as we might have expected:
On the positive side, there are many many queries that didn't work at all due to abbreviation issues that are now passing.
However, there are also a few tests that showing regressions. Overall, there were more regressions than improvements, so total test pass rate actually went down (for autocomplete tests at least).
A good example of the type of regressions we can see are queries like 15 Call Street
The c, calle, call, [etc]
synonym line from the new Spanish file is probably what's causing this particular issue. We knew that single character synonyms are especially dangerous, but c
is very commonly used as an abbreviation for calle
in Spain, so :shrug:.
There's also almost certainly a performance impact to this change: I wasn't able to run the regular autocomplete acceptance tests, even very slowly, on our standard sized dev Elasticsearch instance. I was seeing timeouts and had to jump to a larger instance size. We'll need to do some more testing to quantify exactly what the difference is though.
I think at the the very least we'll want to back out some of the new synonyms in this PR, particularly any single character synonyms. While we definitely want a system to support c/calle
and other abbreviations, I don't believe Elasticsearch synonyms are the way to do that. We will need something that allows us to only handle that abbreviation for results in areas where we know that abbreviation is common, such as Spain.
I think a way we can do this is to move towards a system where we are generating multiple Elasticsearch documents from a single record in our data, and generating variations of the input at query time. This sort of system is already under discussion for, and would solve, our current top issue with alt names and scoring penalties. While more complex, it would allow us to handle abbreviations with more precision and flexibility.
Edit: another thing that might let us buy a little more time with the current system is to use synonym expansion (at index time only, so we'll need https://github.com/pelias/api/pull/1444) so that only calle
is replaced with calle, c
(something like calle => calle, c, call, etc
).
Nice testing notes! just so we're on the same page, are those perf/quality measurements with or without https://github.com/pelias/api/pull/1444?
@missinglink I actually ended up running everything with and without https://github.com/pelias/api/pull/1444. There's only minor differences in the results, but we should examine them more closely too.
Okay cool, let's chat about it on a call, I just added https://github.com/pelias/api/pull/1444/commits/9d4c6db2be4b22dbfadd0d6276d593a52eb8d208, prior to that the peliasQuery
change was only for search.
I also just added https://github.com/pelias/schema/pull/453/commits/e135146c75402092f0d5f53bc6865bd2103c04b4 here to remove those single-letter street synonyms.
I discovered an undesirable behaviour...
As mentioned above in https://github.com/pelias/schema/pull/453#discussion_r443466161 I was pleased to find that synonyms dont 'explode' (ie they don't apply virally).
Turns out that this is only true within a token filter, so it's true that street, st
straße, st
will not create an association street == straße
.
There problem is that where we have saint, st
in another token filter then indexing saint
will result in saint, st, street, straße, ...
And to add to the confusion this is dependent on the order which the token filters are specified in the pipeline.
This makes sense logically because if the synonyms/personal_titles
filter emits a st
token and passes it to the synonyms/streets
filter, I would have preferred at this point that since the token is of type SYNONYM
that it wouldn't generate more synonyms, but this is not the case 🤷
Good news is that I've been looking for an excuse to use the multiplexer tokenfilter and this is the perfect situation.
How the multiplexer
works is that it takes all the tokens in and then runs all the filters
independently of each other and pipes the resulting tokens back to the main pipeline.
So where the the regular token filter pipeline is sequential the multiplexer is synchronic.
This is all fixed up in https://github.com/pelias/schema/pull/453/commits/7b018012b9e60ee38a8cbf018d3d6ad57b1308d7 I also cleaned up some of the logic of which filter runs when, to make it more consistent.
That was a bit wordy, here's a picture:
# before
elyzer --index pelias --analyzer peliasStreet 'saint'
...
{0:saint,st,street,str,stre,stree,strt}
# after
elyzer --index pelias --analyzer peliasStreet 'saint'
...
{0:saint,st}
rebased against master
Just ran our tests against the latest build from this branch, and the results are positive! Of course queries like r gay lussac, paris no longer work, because we've disabled those single character synonyms (for now), but otherwise things are good.
I also confirmed that this PR pairs really well with https://github.com/pelias/api/pull/1444 where we use the peliasQuery
analyzer at search time. With that PR, our abbreviation suite shows lots of improvements:
With the master branch of API as a baseline, there are actually a few regressions, since we end up being a bit more loose with synonyms than we'd currently like:
Okay so, after quite a bit of testing of both result quality and performance, we are happy with this PR.
From our tests, 99th percentile response times for the autocomplete endpoint increase by about 7-8%. This is notable but not significant enough to avoid merging this PR. Otherwise, results are much improved and it also includes a foundation on which to add more synonyms as needed. :tada:
This is a massive overhaul of the synonyms list, adding hundreds of new ones for English, French, Spanish and German.
It can probably replace these PRs:
After the success of https://github.com/pelias/schema/pull/446 I'm going to do a full planet build of this and see how it performs.
cc/ @Joxit I added loads of French synonyms (no multi-word synonyms for now).
I wouldn't suggest merging this until after https://github.com/pelias/api/pull/1444 because applying all these street synonyms at query-time when they aren't required is surely not going to be great from a performance perspective.