kpanic / pollicino

Street search, spiced up with multiple storage and geocoders
Other
19 stars 4 forks source link

Investigate on how to handle typos/mispelings #7

Closed kpanic closed 10 months ago

kpanic commented 9 years ago

Description Fuzzy matching "cannot" be combined with edge ngrams in Elasticsearch See: https://www.found.no/foundation/fuzzy-search/

Scope Find out a way to handle typos

Some resources that might help (or not)

Elastisearch guide: http://www.elastic.co/guide/en/elasticsearch/guide/current/phonetic-matching.html

Elasticsearch plugin: https://github.com/elastic/elasticsearch-analysis-phonetic

Might be also worth to experiment with multiple fields indexed in different ways

missinglink commented 9 years ago

I think you'll find that edgeNGrams handle a large amount of spelling mistakes for you, or at least provide a lot of 'near positive' results.

If you want even more flexibility in 'fat fingered typing' you could also index using traditional ngrams and figure out a boosting algorithm to combine the edgeNGrams with the nGrams, and even shingles and standard tokens.

I am investigating combining the results with singles to reduce the noise from ngrams: https://github.com/pelias/playground/blob/master/ngram/ngam_street_proximity.js

A word of warning, you might want to consider increasing your minGram to 2 or 3, setting it at 1 simply doesn't scale well, when dealing with tens of millions of documents the inverted index for tokens such as 'a' is huge!

Have a look at the link I posted above, I use a somewhat hacky/tricky techinique to index using 2grams while prefixing single digit house numbers with a '0' so that the tokens don't get removed.

hard to explain, basically this: https://github.com/pelias/playground/blob/master/ngram/ngram_analysis.js#L151

..this technique also returns addresses for a single keypress (so long as it's a number), which is probably what you were after.