Closed blackmad closed 4 years ago
This looks like exactly the right approach to me. I love that it includes code to programmatically regenerate the synonyms.
There's some code you can uncomment to overwrite the expected schema:
(I've always found these massive JSON comparisons to be a bit meaningless as a result, but they also do catch things. In some repositories, we use various utilities for pretty-printing a human readable diff of the two JSON objects, which we should standardize on).
You don't have to start with a full planet build, but realistically we usually test most schema changes with a full planet build before merging. This one would definitely qualify, as I bet there are some far-reaching (hopefully mostly positive) impacts of this change.
Over in https://github.com/pelias/pelias/issues/718 I discuss ideas for building up a framework and process where we can quickly and easily test on smaller builds, and then move on to full planet builds only when some basic initial testing has passed.
On the one hand, I'd love to make more progress on that front. On the other hand, we have made planet builds a lot faster over the years, so it may not be as important.
1094 2000th St, West Lincoln Township, IL, USA
😡
thanks for the detailed thought + response
re the immediate points
On Wed, Jun 3, 2020 at 8:24 AM Peter Johnson notifications@github.com wrote:
@missinglink commented on this pull request.
I spent an hour thinking this through today and I think it's pretty much good to merge 🎉
Code review:
- the tokenizer pattern [\s,/\\-]+ seems to split on hyphen, so you should be able to omit the hyphenated synonyms in remove_alpha_ordinals.txt.
- the new token filter was added in 4 places but only two were covered by tests.
- contraction of multi-token phrases to single-token ones can produce unintuative token positions which such that phrase matching fails to work as expected. (eg. [0:two, 1:hundred, 2:st]->[0:200, 2:st], note: the token positions are no longer adjacent)
Discussion:
I don't love the way we currently handle ordinals in remove_ordinals (in both that it's a one-way contraction and that it's based on a regular expression), which is totally my doing and I wish there was a better way.
My preference would be to for us to produce following terms (for all numbers <1000) as bi-directionally synonyms:
["200th", "two hundredth", "two-hundredth", "200", "two-hundred", "two hundred"]
Unfortunately this is not easy to achieve in elasticsearch short of spelling them all out or possibly using something like hunspell https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html which would introduce deployment complexity.
So for now at least we're able to generate the following synonyms:
["200th", "200"]
And with this PR we can add the another term (at least for numbers <50)
["21st", "twenty first", "21"]
Some possible areas for improvement:
- Can we extend this to cover two == 2? what would happend then for "two hundred"?
- Can we support numbers > 50, roughly what percentage of ordinal streets are not covered by this?
- Does adding this to peliasQuery have a perf penatly (presumably slight), what about if we generated 1000 lines?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#pullrequestreview-423497523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMCLNYWDXTYNDDL3YFDRUY6GBANCNFSM4NNNBR2A .
-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com
It might be worth adding a test to ensure that partially completed inputs such as "twen" and "twenty fi" work as expected.
Unfortunately no fix for the token positions issue AFAIK unless it was fixed in elasticsearch 7
One other thought is that we probably don't need to do anything at query-time in the peliasQuery
analyzer.
The rationale is that all the prefix ngrams will have been generated already at index time so it won't have any effect until the complete synonym has been input anyway?
maybe I'm misunderstanding something about how we configure ES.
I think that the synonyms filters are one way, that this change says to the indexer "make all instances of 'twenty-first' into 21 in your index"
which means that a street that's in OSM called "21st st" will only be indexed as "21st st" and "21 st" (due to existing remove ordinals logic)
such that a (geocode) query for "twenty first st", without a query time modification, would not match the 21st st we indexed?
On Wed, Jun 3, 2020 at 9:24 AM Peter Johnson notifications@github.com wrote:
One other thought is that we probably don't need to do anything at query-time in the peliasQuery analyzer.
The rationale is that all the prefix ngrams will have been generated already and so it won't have any effect until the complete synonym has been input anyway?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#issuecomment-638194153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMBSRGYSM2D4X5O4MNLRUZFHRANCNFSM4NNNBR2A .
-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com
No that's not correct, if you use a comma to delimit the synonyms they are bi-directionally synonymous, so for example (how it's configured in your PR):
analyzer: peliasIndexOneEdgeGram
input: '1 1 1'
(the @pos thing is showing the token positions)
expected: |-
{ '@pos0': [ '1' ], '@pos1': [ '1' ], '@pos2': [ '1' ] }
actual: |-
{ '@pos0': [ '1', 'f', 'fi', 'fir', 'firs', 'first' ], '@pos1': [ '1', 'f', 'fi', 'fir', 'firs', 'first' ], '@pos2': [ '1', 'f', 'fi', 'fir', 'firs', 'first' ] }
The other option is to use the =>
syntax which acts as a replacement instead of a synonym.
There's more info in the top of some of the other synonyms files which is copied from the ES/SOLR docs: https://github.com/pelias/schema/blob/master/synonyms/custom_admin.txt
For tokens such as seventy sixth
it's going to produce the tokens seven
and six
which may or may not be desirable, I need to think about that a bit more.
But the alternative is not viable, if you index seventy sixth
only as 76
at index time then a partial query for sev
will never match!
thanks for clarifying
On Wed, Jun 3, 2020 at 10:15 AM Peter Johnson notifications@github.com wrote:
For tokens such as seventy sixth it's going to produce the tokens seven and six which may or may not be desirable, I need to think about that a bit more.
But the alternative is not viable, if you index seventy six only as 76 at index time the a partial query for sev will never match!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#issuecomment-638226614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMF3QBYZDS4DC6AABZDRUZLHVANCNFSM4NNNBR2A .
-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com
Here's an example of querying ES directly to check which tokens get created: https://gist.github.com/missinglink/f6fbcb64f89099f91f7e7c5654796894
I'm starting to feel like this isn't going to be very clean using only the tools offered by elasticsearch.
I've had a similar problem in the past which I simply wasn't able to solve in ES, I ended up having to solve it a 'post processing step' in pelias/model
https://github.com/pelias/model/blob/master/test/post/seperable_street_names.js
Do you think it would make sense to 'normalize' all versions of an ordinal to a single variation before sending it to ES for indexing?
what's concerning you the most about the ES approach?
I think I'm most concerned about every "1st" becoming "first" in the index such that queries for "123 f...." start to really overwhelm elastic, but maybe my intuition is off there.
re: normalizing before indexing
On Wed, Jun 3, 2020 at 10:35 AM Peter Johnson notifications@github.com wrote:
I'm starting to feel like this isn't going to be very clean using only the tools offered by elasticsearch.
I've had a similar problem in the past which I simply wasn't able to solve in ES, I ended up having to solve it a 'post processing step' in pelias/model
https://github.com/pelias/model/blob/master/test/post/seperable_street_names.js
Do you think it would make sense to 'normalize' all versions of an ordinal to a single variation before sending it to ES for indexing?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#issuecomment-638239261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMFQCQJIJZPQPZVZW2DRUZNRHANCNFSM4NNNBR2A .
-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com
I'm struggling a little to buffer this whole thing in my head, there's a few different concerns for sure 😄
For the address_parts.street
field in elasticsearch it's probably not a huge deal, because it only contains street names, so numbers there are pretty safe to assume are synonymous with ordinals.
For the name.default
street (which is an ngrams index), we are indexing the full address, so for an address such as 9 Tenth St Apt 3
the following tokens would be inserted into the index:
position: tokens
----------------
0: [ '9', 'n', 'ni', 'nin', 'nint', 'ninth' ]
1: [ '1', '10', 't', 'te', 'ten', 'tent', 'tenth' ]
2: [ 's', 'st', 'str', 'stre', 'stree', 'street' ]
3: [ 'a', 'ap', 'apt' ]
4: [ '3', 't', 'th', 'thi', 'thir', 'third' ]
Some queries which would match this document:
It's possible to use a MUST
query with phrase_match
to ensure that the target tokens are within a certain 'slop' (position difference) of each other, but as discussed in our call, we would prefer to move away from phrase_match
because it doesn't allow omission of directional tokens.
Without phrase_match
the input tokens MUST
appear in the index, however their position is ignored.
I think this is a problem, because it'll add a lot of false positives to the results such as 3 Ninth St
which are unintuitive to the user.
Regarding normalising, I'm still not sure it's a great idea, just thought I'd mention it as an option.
Presumably all streets such as 1st St
would be converted to First St
before indexing, and yeah the labels would all reflect that.
I would hope to do it in a way that it wouldn't impede finding the street, so you should be able to enter any one of 1 st
, 1st st
and first st
and exact match a street indexed in any of those 3 forms too.
I'd just like to find a way of doing that that doesn't also say that every 1
is equal to ['1', 'f', 'fi', 'fir', 'firs', 'first']
🤷
confirmed that this fixes some queries in the way we expect
coming next:
I'm struggling to do a perfect comparison of this between identical-ish indexes, but here are some queries it fixes from an organic test set:
{ "text": "2620 West Second Avenue Suite 2 Denver, CO 80219", "size": "1", "sources": "osm,oa" } "text": "1037 West First Street Santa Ana CA 92703", "text": "772-9948 District Office 18 North Second Street Saint Clair, PA 17970", "text": "971-0133 1302 Second Street PO Box 328 Richlands, VA 24641", "text": "556 Third Avenue New York, NY 10016", "text": "620 First Avenue South St. Petersburg, FL 33701",
a synthetic example "3rd street bangor me" now works
it seems like the concern we had with blowing up the number of tokens in the index isn't happening - these are one way synonyms, not bidirectional. Here's a document where the original street was "3rd avenue"
for streets that are in OSM as the alpha-ordinal, like this one https://www.openstreetmap.org/way/798163130 third gets the "3" synonym we wanted
this also modified peliasQuery, so that the query "third street" yields a query-time index of "3", which it previously did not
to summarize
indexing time 3rd street -> indexed as 3 + street 3 street -> indexed as 3 + street third street -> indexed as 3/third (new!) + street
query time 3rd street -> queried as 3 + street 3 street -> queried as 3 + street third street -> queried as 3/third (new!) + street
which means that all the combinations of query and index time ordinals, numbers and alpha ordinals should work without bloating the index much
I can't repro this locally?
I ran a local ES starting with this Dockerfile
FROM docker.elastic.co/elasticsearch/elasticsearch:7.6.1
RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
I ran it like so docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" es
and then create-index seemed perfectly happy?
(base) ➜ schema git:(text_ordinals_in_pelias_schema) ✗ PELIAS_CONFIG=pelias_local.json ./bin/create_index
--------------
create index
--------------
[put mapping] pelias { acknowledged: true, shards_acknowledged: true, index: 'pelias' }
(base) ➜ schema git:(text_ordinals_in_pelias_schema) ✗
I noticed that streets that are named with alpha-ordinals in OSM, like "fifth street" , can be search with numeric-ordinals like "5th street" and vice-versa in https://github.com/pelias/schema/issues/441
This change attempts to fix it. Before I try to test it further ...