Allow search query to use a different tokenizer

Alxandr commented 5 years ago

I'm evaluating MiniSearch for a project where I need to search through a fairly small set of documents (about 10-100 documents, only titles). Prefix searching is quite important in this use-case. One issue I don't know how to deal with though is the fact that if I have a tokenizer that removes single char terms (which is fine for the tokenization of the documents), then I don't get a result on the first key (which in this case is desirable). Could the search/indexing be allowed to use different tokenizers? Or maybe just a second argument to the tokenizer indicating what kind of tokenization it's currently doing would be better?

lucaong commented 5 years ago

Hi @Alxandr , good point. The default tokenizer indeed removes single char terms, and I am considering not doing this by default in the next major release. There is some documentation about the default tokenizer and how to change it here, but your issue made me realize that it is not easy to find this information.

In order to configure a tokenizer identical to the default one, but allowing single character terms, you can do this:

const miniSearch = MiniSearch.new({
  tokenize: (string) => string.split(/[^a-zA-Z0-9\u00C0-\u017F]+/)
})

If you are wondering what the \u00C0-\u017F part of the RegExp is doing, it is basically to consider diacritics like umlauts and accents as word characters.

Thanks for considering MiniSearch! I will work on improving the documentation, especially on discoverability.

lucaong commented 5 years ago

Also, at the moment the same tokenizer is used for both document indexing and search. I take your suggestion that it could be useful to differentiate between them. I need to think about a proper way to do that, because in general using different tokenizers for indexing and search could easily cause unexpected behavior, but your case is an important one.

Alxandr commented 5 years ago

@lucaong I think you misunderstood me a bit. I do want to remove single chars (I already use a custom tokenizer) for the indexing phase. But I do not want to do so when searching. The tokenizer is called both by add and by search, and as far as I can tell from the source, there is no way of distinguishing between the calls.

I suggest making the tokenizer call something like this:

tokenize(string, { phase: 'index' })
tokenize(string, { phase: 'search' })

Alxandr commented 5 years ago

Another reason I need to differentiate is that we have a field containing ids that look like this: AB0024. However, it's important that the leading 0s can be ignored, so when tokenizing I have a regex that checks for this exact pattern, and if found emits the following tokens: ['AB24', 'AB024', 'AB0024']. I have no need for this logic when searching, however, because all the tokens are already indexed.

lucaong commented 5 years ago

Yes, I understand better now. You are right, sometimes the tokenization should differentiate between search and indexing. I am a bit concerned that people would use different tokenizers, and thus get inconsistent results, but then again I assume that people who would use this configuration would know what they are doing.

A few alternative ideas that I have:

Similar to your suggestion:

const miniSearch = MiniSearch.new({
  tokenize: (string) => string.split(/[^a-zA-Z0-9\u00C0-\u017F]+/),
  tokenizeSearch: ... // by default uses `tokenize`
})

Specify tokenizer in search options

const miniSearch = MiniSearch.new({
  tokenize: someIndexingTokenizer,
  searchOptions: {
    tokenize: someSearchTokenizer // this would default to the "global" tokenizer
  }
})

// Or specify it upon search
const results = miniSearch.search('some query', {
  tokenize: someSearchTokenizer
})

I am leaning toward the second option, because it is more consistent with the current options. What do you think?

lucaong commented 5 years ago

Also, there is a processTerm option that could also differentiate between indexing and search. A filterTerm option could be added too, or processTerm could be allowed to return null or undefined for discarding a term. I am preparing a new version including this:

const miniSearch = MiniSearch.new({
  tokenize: ..., // Function used to tokenize terms
  processTerm: ..., // Function used to process terms for indexing. Return falsy to discard term
  searchOptions: {
    tokenize: ..., // search-time tokenizer, defaults to the index-time tokenizer
    processTerm: ... // search-time term processing, defaults to the same as index-time
  }
})

// Alternatively, specify search options upon search
const results = miniSearch.search('some query', {
  tokenize: ...,
  processTerm: ...
})

Would this solve your use case? What do you think about it?

Alxandr commented 5 years ago

That would definitely solve my use case. I would suggest going with a second parameter instead though (for both the functions), which will force all tokenizing/processing to go through the same code path initially (just as you said, make it less likely to shoot yourself in the foot). And then in the tokenizer I can do if (isSearch) { do something else } as part of a bigger tokenizer.

My current tokenizer looks something like this:

const terms = [];
if (someTest(string)) {
  addSomeKind(string, terms);
}

if (otherTest(string)) {
  addOtherKind(string, terms);
}

return terms;

I would simply change it to if (isIndexingPhase && someTest(string)) for instance. That being said, it's easy to route two functions into one, and easy to split 1 functions into two (based on condition), so both ways works.

lucaong commented 5 years ago

Hi @Alxandr , I just released a new minor version, v1.1.0, that adds a few features related to your use case:

It is now possible to specify different tokenization and term processing for indexing and search. I decided to go for separate search options rather than arguments, to prevent having to pass too many arguments to tokenize and processTerm (see point 3)
The processTerm function can now discard a term returning a falsy value. This is the recommended way to discard terms (e.g. single character words or stop words), so in many cases the tokenizer can remain the same between indexing and search.
Both tokenize and processTerm receive the field name as the second argument. This makes it possible to tokenize or process each field differently. At search time, when tokenizing or processing the search query, the second argument is undefined.

I hope this will help you. It surely will enable more use cases thanks to your input!

lucaong / minisearch

Allow search query to use a different tokenizer #2

Similar to your suggestion:

Specify tokenizer in search options