olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

lunrjs search Algorithm logic issues and questions #350

Closed ganeshkbhat closed 5 years ago

ganeshkbhat commented 6 years ago

Where can I find the scoring algorithm of the lunrjs search in the documentation? I am using the lunr search for multiple key words say (dummy keywords with real results) test myproject angular and the result is following. What could be causing such a scoring (actual scoring for keywords below)? Is there somewhere I can see how the weight-ages work, if any?

ref /aaa
Search Match - %: 0.16793244647109937, Keys ( metadataWhitelist position Array count) : [ test: 2, myproject: 3, angular: 1 ]
ref /bbb
Search Match - %: 0.11552603800057194, Keys  ( metadataWhitelist position Array count) : [ build: 10, angular: 8 ]
ref /ccc
Search Match - %: 0.10789816439429221, Keys  ( metadataWhitelist position Array count): [ test: 11, myproject: 2, angular: 1 ]
ref /ddd
Search Match - %: 0.0548861249498114, Keys  ( metadataWhitelist position Array count): [ test: 6 ]
ref /eee
Search Match - %: 0.039406345252132866, Keys ( metadataWhitelist position Array count) : [ test: 1 ]
ref /fff
Search Match - %: 0.03666967069118411, Keys ( metadataWhitelist position Array count) : [ angular: 4 ]
ref /ggg
Search Match - %: 0.027083186877048138, Keys ( metadataWhitelist position Array count) : [ test: 1 ]
hoelzro commented 6 years ago

Hi @ganeshkbhat! I don't know if the algorithm is documented as such, but it's a TF-IDF function normalized by field length. Here's what the scoring function looks like in the source code: https://github.com/olivernn/lunr.js/blob/fb70cb4e193722aae2f7e68958428682cef29df6/lib/builder.js#L248

ganeshkbhat commented 6 years ago

Ok awesome. Thank you very much for taking time to respond.

I changed the saturation values to a higher score (say _b = 10) and the results seem ordered correctly as per the calculation priorities (in brackets) here. test (1) myproject (2) angular (3) or test (1) myproject (2). This seems okay for sequentially important words.

It seems to me that the search is not ordering based on _b and _k1 scores to rate on totalMatchesIsScored as per priorities like this. test (1) myproject (1) angular (1) or test (1) myproject (1). Am I missing something?

First, The issue is what if my search string test myproject angular is important in reverse order. test (weight 3) myproject (2) angular (1) . Or test (weight 1) myproject (1) angular (1) Or test (weight 1) myproject (2) angular (1)? What are the recommend defaults for _k1 (normalization) and _b (saturation) these two use cases?

Referencing issue https://github.com/olivernn/lunr.js/issues/33 and https://github.com/olivernn/lunr.js/issues/642 . The wildcards, full word, orders based on priorities match will always clash

I have a feature request here. Can you (@hoelzro ) or @olivernn consider adding a flexibility for the library to add weight-ages for each key word or wildcard by the developer? Thats a better way than disabling wildcard scoring pattern for their issues or searching keywords separately in different search functions (this will avoid multiple iterations of search function for each keyword for people who want weighted checks of each string key (which is currently a crude way to search susceptible to lots of scoring errors especially for scoring of combinations of keywords in documents)).

hoelzro commented 6 years ago

@ganeshkbhat Let's take a quick step back - what is it you're trying to do on a higher level? From how I understand it (and please correct me if I'm wrong), you're trying to assign weights to individual terms in the search string, potentially based on each term's position in the search string.

One thing that's very important that factors into the calculation above is TF-IDF - essentially, this means that matching terms that occur more frequently in a document will raise that document's score for a search, and the frequency of that term across the entire set of documents will lower it. A great example of this (if it weren't a stop word) is the - it's likely very frequent in a matching document, but it's also very frequent in the whole document set, which lowers its contribution to the score as a whole. I don't know much about your particular set of documents, but it could be that, say, 'test' occurs a lot in your document set, which would lower its contribution to the score.

If you want to increase the contribution of an individual term to the scoring calculation, you can use boosts: https://lunrjs.com/guides/searching.html#boosts In my experience, however, boosts are really tricky to get right!

ganeshkbhat commented 6 years ago

@hoelzro Ah... How did I miss boosts? That might solve part of the problem if not all.

Is lunr search driven by lowering scores when consistently used across the document? Is that how it is supposed to be for keywords (other than stop words)? The pitfall of doing this way of calculation without defining the stop words is that it will also consider frequently occuring keywords as part of stopwords reducing relevant rankings. Can you consider set of stopwords array inception from the search function and provide a sample stopwords array instead? That would reduce the chance of errors in scoring reduction for non-stopword keywords. I believe this bias is the one that will impact boosts as well. Or is there a way where I can remove reduction of scoring if a keyword starts counting high presence in a doc; using a search function argument? Did I miss something? If not can this be taken up?

I did see idf and tf calculations but was unable to go through every line of Source to understand idf assignations. Maybe I will see that tomorrow.

hoelzro commented 6 years ago

@ganeshkbhat Yes, if a term occurs consistently throughout a document set, that will lower its impact on search relevance scores. If a term is used consistently throughout a single document, though, that will actually increase its impact. For example, let's say you have a set of 100 documents about software testing. The token "test" will probably be in most documents, which means "test" will have a minimal impact on the score; "unit", however, may only occur in a few documents, and if it's particularly prevalent in one document, that will increase that document's relevance for queries containing the token "unit". Keep in mind that a frequent term (like "test" from our example) won't decrease a score - it will just contribute a small positive component to the final score.

You can absolutely try using a custom stop word list - you'll need to create a filter using lunr.generateStopWordFilter, and replace the existing stopword filter in the pipeline with your custom one. If you need help with this, please let me know! As far as whether or not this is the "right" thing to do, my advice is to try with and without; search can be somewhat of an art!

There's another scoring equation that's apparently better than tf-idf for factoring in high frequency words - it's called Okapi BM25. It's more complicated than tf-idf, but I would interested to see how it influences lunr!

If you're curious about how tf-idf works, the wikipedia page is pretty informative: https://en.wikipedia.org/wiki/Tf–idf . That page might explain the whole "relevance" thing better than I can!

hoelzro commented 6 years ago

@olivernn BTW, is there a place where the scoring logic is documented that I missed? If not, where do you think it would best fit in the docs? This might be useful to other lunr users - maybe I could summarize this discussion and add it!

olivernn commented 6 years ago

@hoelzro I don't think the scoring logic is currently documented anywhere meaningful, adding a guide would be an excellent addition. If you get time please do open a PR on the site, I'm happy to help fill in any missing detail etc.

Lunr 2.x uses an implementation of BM25F, the snippet of source you highlighted previously should match the equation on that page. The difference between BM25 and TF-IDF is fairly subtle, and adds the tuning parameters b and k1.

@ganeshkbhat A common term, i.e. one that appears in every document in the index, is likely not a great search term. It is not useful for distinguishing between relevant and non-relevant documents. In both TF-IDF and BM25 penalising very common terms is achieved by the IDF (inverse document frequency) part of the scoring process. In both TF-IDF and BM25 the IDF component of the score is usually good enough that stop words (e.g. 'the') are not required to be removed. Lunr has to work in a more constrained environment than most search engines and so removes stop words to reduce the size of the index.

The pitfall of doing this way of calculation without defining the stop words is that it will also consider frequently occuring keywords as part of stopwords reducing relevant rankings

A very common word will still be used for scoring as long as it is not filtered out by the stop word filter. So, if your query only contains that common word you will still get ranked documents returned. If you combine a common word with a non common word then the scoring generated by a common word is likely to dominate the resulting scores. I think this is the right behaviour.

You could try and use query time boosts to overcome this but at that point you are trying to override a fairly fundamental part of the search algorithm.

Or is there a way where I can remove reduction of scoring if a keyword starts counting high presence in a doc; using a search function argument?

It might not do exactly what you want, but you can adjust the value of k1. Increasing its value will lessen term saturation. It will allow common terms in a document to influence the documents score more. There is some detail in the guides and a good explanation from elastic

hoelzro commented 6 years ago

@olivernn Ah, shows what I know! If I find the time, which section should I add it to? searching.html.md, I'm thinking?

ganeshkbhat commented 6 years ago

@hoelzro I definitely did not notice the lunr.generateStopWordFilter function and the set of stopwords in the source. Will check it today with a few checks. I was assuming it being deciphered as a stop word if it occurs consistently over a set of documents, irrespective of presence of pre-definitions or not.

@olivernn

A common term, i.e. one that appears in every document in the index, is likely not a great search term. It is not useful for distinguishing between relevant and non-relevant documents.

I understand. It makes more sense to search for a lesser occurring term over a set of documents and high occurring term in a single document.

  1. However, I see a skew in the results where a keyword is found more in a document than a set of documents. This is irrespective of the stopwords addition or removal considering the stopwords are predefined. May be my dataset is small. The data set is here and a sample set of keywords are angular build offline if you wish to take a quick look whenever you have time. Meanwhile, let me try a bigger set of documents and see if this skew persists. I will come back to you with the bigger data set and the keywords in case of issue for you want to take a checkback on alternatives or provide suggestions.

  2. What is the recommended normalisation value and stabilisation value to have equal weight-ages across keywords in a phrase? Higher and lower normalisation values are skewing data scoring to opposite directions of the phrase. I am unable to get the following pattern going right: (I have tried various normalisation scores from -/+10 and stabilisation scores of 0-1)

Keywords (keywords changed values real): test (weight 1) myproject (weight 1) angular (weight 1).

Results I getting from default stabilization/normalization values of the library using above keywords are:

ref /aaa
Search Match - %: 0.16793244647109937, Keys ( metadataWhitelist position Array count) : [ test: 2, myproject: 3, angular: 1 ]
ref /bbb
Search Match - %: 0.11552603800057194, Keys  ( metadataWhitelist position Array count) : [ myproject: 10, angular: 8 ]
ref /ccc
Search Match - %: 0.10789816439429221, Keys  ( metadataWhitelist position Array count): [ test: 11, myproject: 2, angular: 1 ]
ref /ddd
Search Match - %: 0.0548861249498114, Keys  ( metadataWhitelist position Array count): [ test: 6 ]
ref /eee
Search Match - %: 0.039406345252132866, Keys ( metadataWhitelist position Array count) : [ test: 1 ]
ref /fff
Search Match - %: 0.03666967069118411, Keys ( metadataWhitelist position Array count) : [ angular: 4 ]
ref /ggg
Search Match - %: 0.027083186877048138, Keys ( metadataWhitelist position Array count) : [ test: 1 ]

Some sections of results makes sense by the way scoring happens. But some results are skewed. As per the scoring based on frequency this should have happened:

olivernn commented 6 years ago

However, I see a skew in the results where a keyword is found more in a document than a set of documents

What do you mean? Are you saying that a document that has fewer occurrences of a keyword is being ranked higher than a document with more of those keywords? Remember that the size of the document is taken into account in scoring. A document with two words, one being your keyword is more relevant to that keyword than a document with a thousand words, two of them being the keyword.

What is the recommended normalisation value and stabilisation value to have equal weight-ages across keywords in a phrase

Do you mean the tuning parameters b and k1? If so the default values should work for most indexes. I would say that tweaking these values is an advanced step that requires a good understanding of how lunr scores search results as well as the contents of the documents being indexed.

If you could provide an example on jsfiddle (or similar) I can take a look, unfortunately I don't have the time to deep dive on specific data sets, especially if I have to set up the index etc.

ganeshkbhat commented 6 years ago

What do you mean? Are you saying that a document that has fewer occurrences of a keyword is being ranked higher than a document with more of those keywords? Yes

Do you mean the tuning parameters b and k1? Yes

My question: What is the recommended normalization value and stabilization value to have equal weight-ages across keywords in a phrase? test (weight 1) myproject (weight 1) angular (weight 1). Not been able to get it right.