olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Feature request: support for case-sensitive and case-insensitive search #331

Open giuliac89 opened 6 years ago

giuliac89 commented 6 years ago

Hi Oliver, do you plan to add this feature?

hoelzro commented 6 years ago

@giuliac89 FWIW, you could add this feature in current lunr.js by tweaking the pipeline - I believe the forced lowercasing that currently happens happens in the tokenizer.

olivernn commented 6 years ago

.@hoelzro is right, the current down casing happens inside lunr.tokenizer. Unfortunately this would mean you would need to re-implement it just to change that one part.

Do you have a specific use case in mind? How does the current behaviour fall short?

giuliac89 commented 6 years ago

I'm implementing a search engine for a research project related to philological editions. http://evt.labcd.unipi.it/

It's important to add this functionality to ensure more details in the philological studies that will be carried out on these editions.

olivernn commented 6 years ago

So, in your case, a term, say "FOO", has a different meaning than the downcased term "foo"?

As well as lunr.tokenizer the query parser also downcasses terms. This only affects lunr.Index#search, not lunr.Index#query:

// won't work, gets converted to "foo"
idx.search("FOO") 

// will work, no further processing of the terms done
idx.query(function (q) {
  q.term("FOO")
})
giuliac89 commented 6 years ago

Yes, the difference between a term "FOO" and a term "foo" could be basic for some research studies and this is the reason why I would like to include this feature in my search engine. So the only thing that I can do is re-implement the tokenizer.

Do you think that this feature could be interesting for lunr.js?

indolering commented 5 years ago

Do you think that this feature could be interesting for lunr.js?

To be honest, it seems pretty niche. It wouldn't be hard to implement as an all-or-nothing feature of the index (just add it as a config option) but how would you support query time case-sensitivity without blowing up the index size? I think it's important to remember that Lunr is primarily for static websites and size is a big deal....

giuliac89 commented 5 years ago

Well, I tried to develop the feature in my web app and the index size is not a big problem in this case! In a document of about 1460 words, the index size (including two types of metadata) without case-sensitive feature is about 121kb. With case-sensitive feature is about 158kb. Indexing is only in "lowercase mode". To handle case-sensitivity I simply developed a custom tokenizer, in which I create a lunr token like this:

new lunr.Token (token, {
   position: [startIndex, tokenLength],
   index: tokens.length,
   originalToken: originalToken
});

So I register the "original token" as metadata:

0: lunr.Token {
   str: "in",
   metadata: {
      index: 0
      originalToken: "In"
      position: (2) [0, 2]
   }
}

In this way is simple check the case-sensitivity without making the index size increase considerably.

indolering commented 5 years ago

Submit a patch!

olivernn commented 5 years ago

@giuliac89 @indolering this seems like a good candidate for being turned into a plugin, if so we could add it to the new list of plugins on the wiki and the website. If someone does the work to package this up I'm more than happy to feature it.