olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.94k stars 548 forks source link

Wildcard search with multiword query #421

Open StarfallProjects opened 4 years ago

StarfallProjects commented 4 years ago

Hi! I am trying to solve some problems we're having with lunr.js in a DocFX site. I am using the latest DocFX (2.46)

I am trying to force all searches to be wildcard searches by appending a * to every query (this is to get round issues of it not finding things like someFunctionName(param) when people search someFunctionName)

This works, but is causing multi-word searches to break. I cannot work out if it is due to stemming, but I don't think it is. For example:

event sourcing -> returns results for "event sourcing" event sourcing -> returns results for "event" sourcing -> returns results for "sourcing" sourcing -> no results

"sourcing" is in the index.json in full, so it hasn't been stemmed during build. And as I understand it from reading other issues, wildcard searches should not be stemmed? I get the same result with other multi-word terms. You can try it out here if you wish: https://eventstore.org/docs/ The repo is here: https://github.com/EventStore/documentation

I also looked at the possibility it was searching each term separately (by default, I believe lunr.js splits terms on whitespace?) However, replacing the whitespace with a wildcard, or modifying lunr.tokenizer.separator to remove the search for \s, did not help. Additionally, treatment of whitespace does not explain why sourcing* doesn't work.

Unsure if it could be related to this? https://github.com/olivernn/lunr.js/issues/370 But in the index.json, "sourcing" is often followed by other characters (for example, "Event Sourcing Basics"), so again am unsure why sourcing* would fail. It's almost as though it is being fed through the pipeline and stemmed, despite being a wildcard search? But then why would "sourcing" return results?

Note that we need to use search() rather than query() as by default DocFX uses service workers for the search. I hit trouble trying to use query due to functions not being cloneable (and query requiring a function)

Any suggestions would be much appreciated!

StarfallProjects commented 4 years ago

Having slept on it: could it be that the * wildcard isn't matching to whitespace?

hoelzro commented 4 years ago

Hi @StarfallProjects! Just to make sure I understand your issue - does this fiddle demonstrate the behavior you're seeing? If not, could you provide a small fiddle that does?

StarfallProjects commented 4 years ago

Basically yes. I've added a couple more examples and changed the contents of the index.json to better reflect ours: https://jsfiddle.net/htb32nv8/

hoelzro commented 4 years ago

Ok, thanks! I guess I'm kind of confused about the need to pad every search with wildcards - if it's a matter of not finding someFunctionName(param), I recommend changing the tokenizer separator to split on ( and ) as well as spaces and -, plus whatever other characters would be wise to ignore given the document set. It might be that I lack some context around your usage and needs, though!

StarfallProjects commented 4 years ago

No, I think you're understanding - that is a wayyyy more elegant way of doing it (I am new to lunr.js, my apologies)

I'm still curious about the wildcard behaviour - it looks inconsistent but I may well be missing something?

hoelzro commented 4 years ago

Regarding the wildcard behavior you saw, I think that stemming is indeed the culprit here - if you inspect the index object in the fiddle, there are three terms in the inverted index: "basic", "even"`, and "sourc". That's why looking for "sourcing" works - it also gets stemmed down to "sourc", and a match is found. It's also why looking for "sourc*" works - it's a wildcard search so it doesn't get stemmed, but since there is a term with zero or more characters after the "sourc" prefix, you get a match. I'm not sure why "sourcing" would be showing up in your index.json if you're using the default pipeline, though.

BornToDoStuff commented 3 years ago

I am getting a similar issue where I have the word "modification" in my tags, and the stemmer is shortening it to "modif" (I assume for things like "modify") but the problem is that the word modification will not turn up anything, only searching "modif". This seems to only apply for multi-word tags.

akvadrako commented 2 years ago

Is it possible to have prefix-matching for all terms work with stemming?