Searching for content containing punctuation

eskwayrd commented 6 years ago

Hi,

I'm searching programming documentation and, as you can imagine, much of the indexed content contains punctuation. Here's an example:

No resource found that matches the given name: attr 'android:keyboardNavigationCluster'

In this example, there are several issues:

Searching for Navigation returns no results.
Searching for *Navigation* returns results.
Searching for android: results in a QueryParseError exception "unrecognized field 'android'..."
Searching for 'android* returns the correct result,

I understand why the first query fails, and why the second would be necessary.

For the third query, I can catch the exception in my calling code, but how would I form a replacement query that performs the search without checking field names? I can confirm that the colon makes it into the index.

Would I need to write my own plugin so that my readers don't have to know about built-in punctuation (4th issue above)?

eskwayrd commented 6 years ago

I got it figured out, with the help of other issue responses. When I use query() instead of search, the field-specific querying appears to be disabled, and I can include wildcard results automatically.

It does look like I'll need to disable stemming though. My documentation talks about deployments quite a bit. If I search for deploy, I get 34 results. If I search for deployme, I get 7 results, which seems quite counter-intuitive.

olivernn commented 6 years ago

Searching for Navigation returns no results.

This is because the token that was indexed is 'android:keyboardNavigationCluster' and Lunr no longer adds any automatic wildcards to the search string. That said, you could try and make Lunr understand camel cased words and get better results. You could do this by creating a pipeline function to split 'android:keyboardNavigationCluster' into the tokens 'android', 'keyboard', 'navigation', 'cluster' and 'android:keyboardNavigationCluster'. I think this will lead to much better results for the kind of documents you have.

Searching for Navigation returns results.

Good!

Searching for android: results in a QueryParseError exception "unrecognized field 'android'..."

Yeah, when using the search method the string you pass is actually a query string and gives special meaning to some characters, in your case you found what Lunr expects to be a field based search. You can always escape these special characters, so that search would be "android\:keyboardNavigationCluster"

Searching for 'android* returns the correct result

Good!

It does look like I'll need to disable stemming though. My documentation talks about deployments quite a bit. If I search for deploy, I get 34 results. If I search for deployme, I get 7 results, which seems quite counter-intuitive.

Yes, stemming can sometimes get in the way of searching, especially when doing searches while typing. Since you are already using the query might I suggest the following:

idx.query(function (q) {
  q.term(term, { boost: 100 }) // exact match
  q.term(term, { usePipeline: false, wildcard: lunr.query.wildcard.TRAILING, boost: 10 }) // prefix match, no stemmer
  q.term(term, { usePipeline: false, editDistance: 1, boost: 1 }) // fuzzy matching
})

This tries doing an exact match (using the stemmer in the pipeline), a prefix match, without the stemmer, and a fuzzy match. Different boosts are applied to each term so that exact matches will score higher than partial matches.

eskwayrd commented 6 years ago

I went with your last suggestion (it is almost identical to the examples in other discussions). However, disabling stemming during the query is not so effective, since the index contains stemmed words; the indirect effects of stemming still exist.

I've disabled stemming during indexing, plus the stop words filter (because many code samples in my content include and and or. Now my search results are much closer to what I and my readers expect. I'm experimenting with various editDistance settings and boosts, but my searches are in a much happier place now.

Thanks!

olivernn / lunr.js

Searching for content containing punctuation #301