pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

translate alpha-ordinals to numbers #442

Closed blackmad closed 4 years ago

blackmad commented 4 years ago

I noticed that streets that are named with alpha-ordinals in OSM, like "fifth street" , can be search with numeric-ordinals like "5th street" and vice-versa in https://github.com/pelias/schema/issues/441

This change attempts to fix it. Before I try to test it further ...

orangejulius commented 4 years ago

This looks like exactly the right approach to me. I love that it includes code to programmatically regenerate the synonyms.

Expected.json

There's some code you can uncomment to overwrite the expected schema:

https://github.com/pelias/schema/blob/c242e84c866289ff03204152935b5136c5c129e4/test/compile.js#L113-L119

(I've always found these massive JSON comparisons to be a bit meaningless as a result, but they also do catch things. In some repositories, we use various utilities for pretty-printing a human readable diff of the two JSON objects, which we should standardize on).

Planet build

You don't have to start with a full planet build, but realistically we usually test most schema changes with a full planet build before merging. This one would definitely qualify, as I bet there are some far-reaching (hopefully mostly positive) impacts of this change.

Over in https://github.com/pelias/pelias/issues/718 I discuss ideas for building up a framework and process where we can quickly and easily test on smaller builds, and then move on to full planet builds only when some basic initial testing has passed.

On the one hand, I'd love to make more progress on that front. On the other hand, we have made planet builds a lot faster over the years, so it may not be as important.

missinglink commented 4 years ago

1094 2000th St, West Lincoln Township, IL, USA

😡

blackmad commented 4 years ago

thanks for the detailed thought + response

re the immediate points

On Wed, Jun 3, 2020 at 8:24 AM Peter Johnson notifications@github.com wrote:

@missinglink commented on this pull request.

I spent an hour thinking this through today and I think it's pretty much good to merge 🎉

Code review:

  • the tokenizer pattern [\s,/\\-]+ seems to split on hyphen, so you should be able to omit the hyphenated synonyms in remove_alpha_ordinals.txt.
  • the new token filter was added in 4 places but only two were covered by tests.
  • contraction of multi-token phrases to single-token ones can produce unintuative token positions which such that phrase matching fails to work as expected. (eg. [0:two, 1:hundred, 2:st]->[0:200, 2:st], note: the token positions are no longer adjacent)

Discussion:

I don't love the way we currently handle ordinals in remove_ordinals (in both that it's a one-way contraction and that it's based on a regular expression), which is totally my doing and I wish there was a better way.

My preference would be to for us to produce following terms (for all numbers <1000) as bi-directionally synonyms:

["200th", "two hundredth", "two-hundredth", "200", "two-hundred", "two hundred"]

Unfortunately this is not easy to achieve in elasticsearch short of spelling them all out or possibly using something like hunspell https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html which would introduce deployment complexity.

So for now at least we're able to generate the following synonyms:

["200th", "200"]

And with this PR we can add the another term (at least for numbers <50)

["21st", "twenty first", "21"]

Some possible areas for improvement:

  • Can we extend this to cover two == 2? what would happend then for "two hundred"?
  • Can we support numbers > 50, roughly what percentage of ordinal streets are not covered by this?
  • Does adding this to peliasQuery have a perf penatly (presumably slight), what about if we generated 1000 lines?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#pullrequestreview-423497523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMCLNYWDXTYNDDL3YFDRUY6GBANCNFSM4NNNBR2A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

missinglink commented 4 years ago

It might be worth adding a test to ensure that partially completed inputs such as "twen" and "twenty fi" work as expected.

missinglink commented 4 years ago

Unfortunately no fix for the token positions issue AFAIK unless it was fixed in elasticsearch 7

missinglink commented 4 years ago

One other thought is that we probably don't need to do anything at query-time in the peliasQuery analyzer.

The rationale is that all the prefix ngrams will have been generated already at index time so it won't have any effect until the complete synonym has been input anyway?

blackmad commented 4 years ago

maybe I'm misunderstanding something about how we configure ES.

I think that the synonyms filters are one way, that this change says to the indexer "make all instances of 'twenty-first' into 21 in your index"

which means that a street that's in OSM called "21st st" will only be indexed as "21st st" and "21 st" (due to existing remove ordinals logic)

such that a (geocode) query for "twenty first st", without a query time modification, would not match the 21st st we indexed?

On Wed, Jun 3, 2020 at 9:24 AM Peter Johnson notifications@github.com wrote:

One other thought is that we probably don't need to do anything at query-time in the peliasQuery analyzer.

The rationale is that all the prefix ngrams will have been generated already and so it won't have any effect until the complete synonym has been input anyway?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#issuecomment-638194153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMBSRGYSM2D4X5O4MNLRUZFHRANCNFSM4NNNBR2A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

missinglink commented 4 years ago

No that's not correct, if you use a comma to delimit the synonyms they are bi-directionally synonymous, so for example (how it's configured in your PR):

analyzer: peliasIndexOneEdgeGram
input: '1 1 1'

(the @pos thing is showing the token positions)

      expected: |-
        { '@pos0': [ '1' ], '@pos1': [ '1' ], '@pos2': [ '1' ] }
      actual: |-
        { '@pos0': [ '1', 'f', 'fi', 'fir', 'firs', 'first' ], '@pos1': [ '1', 'f', 'fi', 'fir', 'firs', 'first' ], '@pos2': [ '1', 'f', 'fi', 'fir', 'firs', 'first' ] }

The other option is to use the => syntax which acts as a replacement instead of a synonym.

There's more info in the top of some of the other synonyms files which is copied from the ES/SOLR docs: https://github.com/pelias/schema/blob/master/synonyms/custom_admin.txt

missinglink commented 4 years ago

For tokens such as seventy sixth it's going to produce the tokens seven and six which may or may not be desirable, I need to think about that a bit more.

But the alternative is not viable, if you index seventy sixth only as 76 at index time then a partial query for sev will never match!

blackmad commented 4 years ago

thanks for clarifying

On Wed, Jun 3, 2020 at 10:15 AM Peter Johnson notifications@github.com wrote:

For tokens such as seventy sixth it's going to produce the tokens seven and six which may or may not be desirable, I need to think about that a bit more.

But the alternative is not viable, if you index seventy six only as 76 at index time the a partial query for sev will never match!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#issuecomment-638226614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMF3QBYZDS4DC6AABZDRUZLHVANCNFSM4NNNBR2A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

missinglink commented 4 years ago

Here's an example of querying ES directly to check which tokens get created: https://gist.github.com/missinglink/f6fbcb64f89099f91f7e7c5654796894

missinglink commented 4 years ago

I'm starting to feel like this isn't going to be very clean using only the tools offered by elasticsearch.

I've had a similar problem in the past which I simply wasn't able to solve in ES, I ended up having to solve it a 'post processing step' in pelias/model https://github.com/pelias/model/blob/master/test/post/seperable_street_names.js

Do you think it would make sense to 'normalize' all versions of an ordinal to a single variation before sending it to ES for indexing?

blackmad commented 4 years ago

what's concerning you the most about the ES approach?

I think I'm most concerned about every "1st" becoming "first" in the index such that queries for "123 f...." start to really overwhelm elastic, but maybe my intuition is off there.

re: normalizing before indexing

On Wed, Jun 3, 2020 at 10:35 AM Peter Johnson notifications@github.com wrote:

I'm starting to feel like this isn't going to be very clean using only the tools offered by elasticsearch.

I've had a similar problem in the past which I simply wasn't able to solve in ES, I ended up having to solve it a 'post processing step' in pelias/model

https://github.com/pelias/model/blob/master/test/post/seperable_street_names.js

Do you think it would make sense to 'normalize' all versions of an ordinal to a single variation before sending it to ES for indexing?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/442#issuecomment-638239261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMFQCQJIJZPQPZVZW2DRUZNRHANCNFSM4NNNBR2A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

missinglink commented 4 years ago

I'm struggling a little to buffer this whole thing in my head, there's a few different concerns for sure 😄

For the address_parts.street field in elasticsearch it's probably not a huge deal, because it only contains street names, so numbers there are pretty safe to assume are synonymous with ordinals.

For the name.default street (which is an ngrams index), we are indexing the full address, so for an address such as 9 Tenth St Apt 3 the following tokens would be inserted into the index:

position: tokens
----------------
0: [ '9', 'n', 'ni', 'nin', 'nint', 'ninth' ]
1: [ '1', '10', 't', 'te', 'ten', 'tent', 'tenth' ]
2: [ 's', 'st', 'str', 'stre', 'stree', 'street' ]
3: [ 'a', 'ap', 'apt' ]
4: [ '3', 't', 'th', 'thi', 'thir', 'third' ]

Some queries which would match this document:

It's possible to use a MUST query with phrase_match to ensure that the target tokens are within a certain 'slop' (position difference) of each other, but as discussed in our call, we would prefer to move away from phrase_match because it doesn't allow omission of directional tokens.

Without phrase_match the input tokens MUST appear in the index, however their position is ignored. I think this is a problem, because it'll add a lot of false positives to the results such as 3 Ninth St which are unintuitive to the user.


Regarding normalising, I'm still not sure it's a great idea, just thought I'd mention it as an option.

Presumably all streets such as 1st St would be converted to First St before indexing, and yeah the labels would all reflect that.

I would hope to do it in a way that it wouldn't impede finding the street, so you should be able to enter any one of 1 st, 1st st and first st and exact match a street indexed in any of those 3 forms too.

I'd just like to find a way of doing that that doesn't also say that every 1 is equal to ['1', 'f', 'fi', 'fir', 'firs', 'first'] 🤷

blackmad commented 4 years ago

confirmed that this fixes some queries in the way we expect

image

coming next:

blackmad commented 4 years ago

I'm struggling to do a perfect comparison of this between identical-ish indexes, but here are some queries it fixes from an organic test set:

{ "text": "2620 West Second Avenue Suite 2 Denver, CO 80219", "size": "1", "sources": "osm,oa" } "text": "1037 West First Street Santa Ana CA 92703", "text": "772-9948 District Office 18 North Second Street Saint Clair, PA 17970", "text": "971-0133 1302 Second Street PO Box 328 Richlands, VA 24641", "text": "556 Third Avenue New York, NY 10016", "text": "620 First Avenue South St. Petersburg, FL 33701",

a synthetic example "3rd street bangor me" now works


it seems like the concern we had with blowing up the number of tokens in the index isn't happening - these are one way synonyms, not bidirectional. Here's a document where the original street was "3rd avenue"

image

for streets that are in OSM as the alpha-ordinal, like this one https://www.openstreetmap.org/way/798163130 third gets the "3" synonym we wanted

image


this also modified peliasQuery, so that the query "third street" yields a query-time index of "3", which it previously did not image


to summarize

indexing time 3rd street -> indexed as 3 + street 3 street -> indexed as 3 + street third street -> indexed as 3/third (new!) + street

query time 3rd street -> queried as 3 + street 3 street -> queried as 3 + street third street -> queried as 3/third (new!) + street

which means that all the combinations of query and index time ordinals, numbers and alpha ordinals should work without bloating the index much

blackmad commented 4 years ago

I can't repro this locally?

I ran a local ES starting with this Dockerfile

FROM docker.elastic.co/elasticsearch/elasticsearch:7.6.1
RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu

I ran it like so docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" es

and then create-index seemed perfectly happy?

(base) ➜  schema git:(text_ordinals_in_pelias_schema) ✗ PELIAS_CONFIG=pelias_local.json ./bin/create_index

--------------
 create index 
--------------

[put mapping]    pelias { acknowledged: true, shards_acknowledged: true, index: 'pelias' } 

(base) ➜  schema git:(text_ordinals_in_pelias_schema) ✗