Wordending - Githubissues

karussell commented 10 years ago

Using wordending is kind of a workaround for nedgegram searches like

berlin erlange

which would match berlinerstraße erlangen but better should only match stuff like 'berlin erlange*'.

When this workaround is used - why not avoid edge ngram at all and tokenize the query, plus do a prefix query for the last term? This would save space and memory with same quality. The only problem could be performance but my simple tests for small data don't tell me problems there.

christophlingg commented 10 years ago

Yohan told me this is something we should try out to see which option has the best tradeoff between performance and storage size.

karussell commented 10 years ago

Yes, sure. Maybe there is even a better, less hacky way of doing this. E.g. like the cross_fields approach and still using nedge gram where it would just boost berlin erlanger* more than berlin* erlangen somehow.

karussell commented 10 years ago

These docs seems to be more current + better example

christophlingg commented 10 years ago

btw. the cross_fields approach might make the collector field obsolete. we introduced it to have equal idf for all fields. But I haven't been aware of this feature so far...

@yohanboniface , we could even give different scores to each field, not only distinguish between name and collector. And much more important, we aren't forced to copy each time the default fields into the language specific collectors. this opens the door for multilingual support of all languages in osm as we save a lot of storage size...

karussell commented 10 years ago

Yes, kind of recent feature but we'll have to try if this solves our problem.

yohanboniface commented 10 years ago

cross_fields can't work with fuzzy atm.

yohanboniface commented 10 years ago

btw, wordending is not the hotest topic if you have time to spend on search logic. Two things we are on:

unboost fuzzied matches
cut off tf/idf (certainly using custom similarity, see https://github.com/yohanboniface/elasticsearch-photon-similarity for tests, not yet working)

On the search logic part, the more up to date branch is https://github.com/komoot/photon/tree/positivescoring

Also: add tests! :)

karussell commented 10 years ago

Probably we also need a mailing list. Should I create a google group or one at openstreetmap?

karussell commented 10 years ago

Re tests: do you mean creating Java test suite (master) or adding others? I could go to create Java test stuff

yohanboniface commented 10 years ago

Probably we also need a mailing list. Should I create a google group or one at openstreetmap?

I'd go for geocoding@openstreetmap.org, to keep the argument open instead of having a mailing dedicated for photon, then one for pelias, etc.

Re tests: do you mean creating Java test suite (master) or adding others? I could go to create Java test stuff

I was referring to search tests, like those, but all tests are good ;) BTW, Christoph already started on the Java side I think.

karussell commented 10 years ago

Hmmh, 'geocoding' mainly sends issues. I would prefer a list dedicated to discussion where nominatim and photon would be okay but there are similar projects like e.g. GraphHopper and OSRM which have separate lists ;)

I was referring to search tests, like those, but all tests are good ;) BTW, Christoph already started on the Java side I think.

Ok, we still need some more lightweight test cases in Java I think. I've create a PR for that. See e.g. this

christophlingg commented 10 years ago

I like the idea of a mailing list and would go for a photon specific mailing list as geocoding is super generic. Do you know who we can approach for setting up a new osm mailing list?

Great commit, peter!

karussell commented 10 years ago

@christophlingg I'll give you the mail via mail ;)

komoot / photon

Wordending #56