Add ability to change lunr tokenizer

itemsapi / itemsjs

Extremely fast faceted search engine in JavaScript - lightweight, flexible, and simple to use

Apache License 2.0

346 stars 41 forks source link

Add ability to change lunr tokenizer #28

Closed saminzadeh closed 3 years ago

saminzadeh commented 6 years ago

In order to make search more powerful, adding the ability to change the lunr.tokenizer.seperator would be nice.

https://github.com/olivernn/lunr.js/blob/master/lib/tokenizer.js#L69-L76

Out of the box, if you have the string: this.test

The query test will return empty, but this.t will return this.test.

By changing the regex used for tokenization, you could solve problems like this

cigolpl commented 6 years ago

@saminzadeh great idea! Unfortunately ItemsJS is using Lunr v1.0.0 and now Lunr is v2.3.3. Hope also Lunr is exposing public function or constructor for changing lunr.tokenizer.seperator. Ideally if simple change with v1.0.0 is possible otherwise ItemsJS should be upgraded to latest Lunr sooner or later. I could hopefully look into it when I find more free time

saminzadeh commented 6 years ago

Sounds good, might take a stab when I get a chance as well.

For now, I did this and it seems to work since it changes the lunr global instance.

import lunr from 'lunr';
import itemsjs from 'itemsjs';

lunr.tokenizer.separator = /[\s\-[\]:]+/g;

// itemsjs init here
const index = itemsjs(data, configuration);

cigolpl commented 6 years ago

Nice hack! Integration should be easier than I thought

I am wondering how to test easily separators: I've test it out with:

var paragraph = 'The quick brown fox jumped over the lazy dog. It barked.';
var regex = /[\s\-[\]:]+/g;
var found = paragraph.split(regex);

console.log(found);
// ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog.", "It", "barked."]

Testing with Lunr function lunr.tokenizer seems to be more complicated for testing results

I am wondering because if we implement this feature for devs then it could be as separator / regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html). Second seems nice but first (your suggestion) the simplest to start with and very flexible

saminzadeh commented 6 years ago

I am wondering because if we implement this feature for devs then it could be as separator / regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html).

Hmm, yes that could be nice. But I agree, would probably want to iron that out a bit more before it is introduced into the API interface.

Second seems nice but first (your suggestion) the simplest to start with and very flexible

Yes I think this could be the best starting point. Just having access to the lunr object via itemsjs would be the most flexible for advanced users and will ensure the correct dependency

Something like this:

import lunr from 'itemjs/lunr';

lunr.tokenizer.separator = /[\s\-[\]:]+/g;

import ItemsJS from 'itemjs';
const index = ItemsJS(data, config);
index.lunr.tokenizer.separator = /[\s\-[\]:]+/g;

index.search({...})

cigolpl commented 6 years ago

Makes sense!

cigolpl commented 3 years ago

@saminzadeh I've introduced simple full text integration with all external search engines in the latest version. You can see here -> https://github.com/itemsapi/itemsjs/blob/master/docs/lunr2-integration.md or in Readme