Closed Alxandr closed 5 years ago
Hi @Alxandr , good point. The default tokenizer indeed removes single char terms, and I am considering not doing this by default in the next major release. There is some documentation about the default tokenizer and how to change it here, but your issue made me realize that it is not easy to find this information.
In order to configure a tokenizer identical to the default one, but allowing single character terms, you can do this:
const miniSearch = MiniSearch.new({
tokenize: (string) => string.split(/[^a-zA-Z0-9\u00C0-\u017F]+/)
})
If you are wondering what the \u00C0-\u017F
part of the RegExp is doing, it is basically to consider diacritics like umlauts and accents as word characters.
Thanks for considering MiniSearch! I will work on improving the documentation, especially on discoverability.
Also, at the moment the same tokenizer is used for both document indexing and search. I take your suggestion that it could be useful to differentiate between them. I need to think about a proper way to do that, because in general using different tokenizers for indexing and search could easily cause unexpected behavior, but your case is an important one.
@lucaong I think you misunderstood me a bit. I do want to remove single chars (I already use a custom tokenizer) for the indexing phase. But I do not want to do so when searching. The tokenizer is called both by add
and by search
, and as far as I can tell from the source, there is no way of distinguishing between the calls.
I suggest making the tokenizer call something like this:
tokenize(string, { phase: 'index' })
tokenize(string, { phase: 'search' })
Another reason I need to differentiate is that we have a field containing ids that look like this: AB0024
. However, it's important that the leading 0s can be ignored, so when tokenizing I have a regex that checks for this exact pattern, and if found emits the following tokens: ['AB24', 'AB024', 'AB0024']
. I have no need for this logic when searching, however, because all the tokens are already indexed.
Yes, I understand better now. You are right, sometimes the tokenization should differentiate between search and indexing. I am a bit concerned that people would use different tokenizers, and thus get inconsistent results, but then again I assume that people who would use this configuration would know what they are doing.
A few alternative ideas that I have:
const miniSearch = MiniSearch.new({
tokenize: (string) => string.split(/[^a-zA-Z0-9\u00C0-\u017F]+/),
tokenizeSearch: ... // by default uses `tokenize`
})
const miniSearch = MiniSearch.new({
tokenize: someIndexingTokenizer,
searchOptions: {
tokenize: someSearchTokenizer // this would default to the "global" tokenizer
}
})
// Or specify it upon search
const results = miniSearch.search('some query', {
tokenize: someSearchTokenizer
})
I am leaning toward the second option, because it is more consistent with the current options. What do you think?
Also, there is a processTerm
option that could also differentiate between indexing and search. A filterTerm
option could be added too, or processTerm
could be allowed to return null
or undefined
for discarding a term. I am preparing a new version including this:
const miniSearch = MiniSearch.new({
tokenize: ..., // Function used to tokenize terms
processTerm: ..., // Function used to process terms for indexing. Return falsy to discard term
searchOptions: {
tokenize: ..., // search-time tokenizer, defaults to the index-time tokenizer
processTerm: ... // search-time term processing, defaults to the same as index-time
}
})
// Alternatively, specify search options upon search
const results = miniSearch.search('some query', {
tokenize: ...,
processTerm: ...
})
Would this solve your use case? What do you think about it?
That would definitely solve my use case. I would suggest going with a second parameter instead though (for both the functions), which will force all tokenizing/processing to go through the same code path initially (just as you said, make it less likely to shoot yourself in the foot). And then in the tokenizer I can do if (isSearch) { do something else }
as part of a bigger tokenizer.
My current tokenizer looks something like this:
const terms = [];
if (someTest(string)) {
addSomeKind(string, terms);
}
if (otherTest(string)) {
addOtherKind(string, terms);
}
return terms;
I would simply change it to if (isIndexingPhase && someTest(string))
for instance. That being said, it's easy to route two functions into one, and easy to split 1 functions into two (based on condition), so both ways works.
Hi @Alxandr ,
I just released a new minor version, v1.1.0
, that adds a few features related to your use case:
tokenize
and processTerm
(see point 3)processTerm
function can now discard a term returning a falsy value. This is the recommended way to discard terms (e.g. single character words or stop words), so in many cases the tokenizer can remain the same between indexing and search.tokenize
and processTerm
receive the field name as the second argument. This makes it possible to tokenize or process each field differently. At search time, when tokenizing or processing the search query, the second argument is undefined.I hope this will help you. It surely will enable more use cases thanks to your input!
I'm evaluating MiniSearch for a project where I need to search through a fairly small set of documents (about 10-100 documents, only titles). Prefix searching is quite important in this use-case. One issue I don't know how to deal with though is the fact that if I have a tokenizer that removes single char terms (which is fine for the tokenization of the documents), then I don't get a result on the first key (which in this case is desirable). Could the search/indexing be allowed to use different tokenizers? Or maybe just a second argument to the tokenizer indicating what kind of tokenization it's currently doing would be better?