Closed saminzadeh closed 3 years ago
@saminzadeh great idea! Unfortunately ItemsJS is using Lunr v1.0.0
and now Lunr is v2.3.3
. Hope also Lunr is exposing public function or constructor for changing lunr.tokenizer.seperator
. Ideally if simple change with v1.0.0
is possible otherwise ItemsJS should be upgraded to latest Lunr sooner or later. I could hopefully look into it when I find more free time
Sounds good, might take a stab when I get a chance as well.
For now, I did this and it seems to work since it changes the lunr global instance.
import lunr from 'lunr';
import itemsjs from 'itemsjs';
lunr.tokenizer.separator = /[\s\-[\]:]+/g;
// itemsjs init here
const index = itemsjs(data, configuration);
Nice hack! Integration should be easier than I thought
I am wondering how to test easily separators: I've test it out with:
var paragraph = 'The quick brown fox jumped over the lazy dog. It barked.';
var regex = /[\s\-[\]:]+/g;
var found = paragraph.split(regex);
console.log(found);
// ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog.", "It", "barked."]
Testing with Lunr
function lunr.tokenizer
seems to be more complicated for testing results
I am wondering because if we implement this feature for devs then it could be as separator
/ regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html). Second seems nice but first (your suggestion) the simplest to start with and very flexible
I am wondering because if we implement this feature for devs then it could be as separator / regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html).
Hmm, yes that could be nice. But I agree, would probably want to iron that out a bit more before it is introduced into the API interface.
Second seems nice but first (your suggestion) the simplest to start with and very flexible
Yes I think this could be the best starting point. Just having access to the lunr object via itemsjs
would be the most flexible for advanced users and will ensure the correct dependency
Something like this:
import lunr from 'itemjs/lunr';
lunr.tokenizer.separator = /[\s\-[\]:]+/g;
or
import ItemsJS from 'itemjs';
const index = ItemsJS(data, config);
index.lunr.tokenizer.separator = /[\s\-[\]:]+/g;
index.search({...})
Makes sense!
@saminzadeh I've introduced simple full text integration with all external search engines in the latest version. You can see here -> https://github.com/itemsapi/itemsjs/blob/master/docs/lunr2-integration.md or in Readme
In order to make search more powerful, adding the ability to change the
lunr.tokenizer.seperator
would be nice.https://github.com/olivernn/lunr.js/blob/master/lib/tokenizer.js#L69-L76
Out of the box, if you have the string:
this.test
The query
test
will return empty, butthis.t
will returnthis.test
.By changing the regex used for tokenization, you could solve problems like this