Partial word matching - Githubissues

dannydan412 commented 10 years ago

Hi Oliver,

Do you know if it's possible to add partial word matching support to lunr? Currently it seems that it adds a wildcard to the end of each token but not to the beginning. So for example, if I search for "create" and the document has the word "recreate" it wouldn't find it. I know stemming may be a solution to this, but I prefer not to use it. Also, on a similar subject, how difficult would it be to add fuzzy matching (e.g. if I made a typo and searched for "recreete" instead of "recreate"). Can you give me some pointers as to where in the code I should be looking to address these?

Thank you!!

dc-artemis commented 10 years ago

Have you found a way to get partial matching working?

If not: The data structure used is a trie, which lends itself most naturally to return success on full matches or variable tails. I'm looking at changing the trie implementation of the TokenStore modeled off this paper, namely by adding the index table and allowing searches to start from any index of the first letter.

Any thoughts on this approach?

Note: I'm not sure how this affects tf-idf calculation, but I'd guess it breaks it. This may not be a general solution, but usable if you sort your output without tf-idf.

dc-artemis commented 10 years ago

This approach did work for smaller data sets, however I don't believe it would be supportable in the long-term due to major inefficiencies based on a quick analysis of my solution as well as the broken tf-idf.

guirip commented 7 years ago

Hello all

Is there any new information about how to perform partial word matching?

e.g. My need is the following :

an entry named: 'THAT XYZ COMPANY'
must be retrieved when the user types: 'MPAN'
must work with french words too

If there is no simple way to do this with lunr.js, is there another library you could advise me ?

Cheers

olivernn commented 7 years ago

This is available in the upcoming 2.x release of lunr.

You can try this out now on the demo I put up, for example try a search with "*tronau*" and you should see documents matching "astronauts".

There is a fairly stable alpha release that you can try out available on npm, add this to your dependencies:

{
  "dependencies": {
    "lunr": "alpha"
  }
}

As for language support, this will happen, but the existing lunr-languages project has not yet been updated to support a change in interface required for the 2.x branch. Support will be available by the time 2.x is live.

There are small changes in the interface to use 2.x, hopefully the example app makes this clear.

There will be a 2.x release soon, but any feedback on the alpha will be really useful, so if you have a chance to try this out I'd be super grateful.

guirip commented 7 years ago

Great news ! :-)

I've read your example and had a few tries, but I can't get any single result, I must be missing something. Here is a very basic jsfiddle where an alert displays the results count : https://jsfiddle.net/of54k0uk/3/

olivernn commented 7 years ago

I've updated your example, there are no automatic wildcards like in the current version of lunr, so you have to search for "fast*".

That said, there is something odd going on because it doesn't seem to be handling the all caps query very well, I'll take a look.

olivernn commented 7 years ago

Oops, forgot the link to the fiddle

olivernn commented 7 years ago

I've just pushed alpha.4 now that has a fix for the issue around upper case queries, I've also updated the fiddle again to make use of the latest version.

Thanks for taking the time to test the alpha, this kind of testing is really useful. If you find any other problems while testing feel free to open an issue.

guirip commented 7 years ago

Thanks !! It's good to see the results coming out !

As I trigger search from a generic field filled by the user, I automatically add starting and ending wildcards (e.g. If user types abc, then I trigger a lunr search for *abc*.). This is what I notice:

Let's say I have an entry named 'ABC PETSCHE COMPANY'.

[x] If user types 'OMPAN', the entry is returned as expected.
[x] If user types 'COMPAN', the entry is returned as expected.
[ ] If user types 'OMPANY', the entry is not returned... jsfiddle/8. The last wildcard seems to mean 'any', contrary to the starting wildcard which works as '0 or any' (which is good to me!), e.g. search for *abc* and you get the result even if there is nothing before 'abc'.
[ ] If user types ' COMP' (notice the starting space), an exception occurs. Here is jsfiddle/9, simply add a starting space and you'll get uncaught exception: duplicate index.

(edited the last one because it was actually two different issues. I am still trying to reproduce the second on a jsfiddle)

This is what I can tell you for now ;-) Will notice you if I find anything else. Cheers

guirip commented 7 years ago

Ok, I reproduced the last issue I noticed. It just needed more data. Here is the jsfiddle.

[x] Search for 'CAM' : success

[ ] Search for 'COM': failure, you get an exception at https://github.com/olivernn/lunr.js/blob/v2.0.0-alpha.4/lunr.js#L1741:

TypeError: Cannot read property '_index' of undefined
at lunr.Index.query (webpack:///./~/lunr/lunr.js?:1741:32)
at lunr.Index.search (webpack:///./~/lunr/lunr.js?:1647:15)

olivernn commented 7 years ago

Okay, that exception you are getting is bug in the library. It seems that somehow the term 'com' is expanding to a word that does not exist in the index, this shouldn't happen. Terms with wildcards are expanded using words that do exist in the index (using the TokenStore object). How this object is built is quite complicated, and I'm going to have to try and create a reduced test case to be able to step through the code to understand why this is happening.

There is a workaround that can be added to the library, but I'd rather fix the underlying bug.

olivernn commented 7 years ago

Ok, I just pushed a change which fixes this issue, I've also updated the fiddle. Thanks for reporting this. If you find anything else while testing then please open an issue. Its probably best to open a new issue, just so things are easier to track.

guirip commented 7 years ago

facepalm, you're totally right, I don't know why I have not created new issues.

Thanks for the fix ! will let you know if I find anything suspicious.

olivernn commented 7 years ago

Ha, no problem, I'm just thankful for getting some great feedback and some alpha testing!

olivernn commented 7 years ago

Resolving this now as I think the original, and follow up issues, have been resolved.

Beramos commented 4 years ago

@guirip

Love the '' + query +'' hack :+1:

olivernn / lunr.js

Partial word matching #68