Closed hungerburg closed 6 years ago
Sorry for the late reply.
You could definitely wrap that normalise function up into a lunr plugin. There is a similar project, lunr-unicode-normalizer, but I don't think it has been updated for lunr 2.
As for autocomplete, I need to get round to actually putting a demo of this together, but this is what I've been suggesting to people.
idx.query(function (q) {
// exact matches should have the highest boost
q.term(searchTerm, { boost: 100 })
// prefix matches should be boosted slightly
q.term(searchTerm, { boost: 10, usePipeline: false, wildcard: lunr.Query.wildcard.TRAILING })
// finally, try a fuzzy search, without any boost
q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
})
I disable the pipeline to prevent stemming getting in the way, you would have to experiment if this makes sense for your use case, especially if you wanted to add the unicode normalising plugin.
Additionally, when using the query
method lunr won't be doing any tokenisation for you, you can either handle this your self, or borrow the lunr.tokenizer
directly, or its regex to split into individual tokens.
Dear Oliver, no need to be sorry! With the help of your snippet I cobbled together a little more context, so that, in fact, I get around creating an extra iindex upfront but instead can feed the autocomplete from lunr itself and also can get rid of my own normalizer by using the pipeline:
// LUNR query for autocomplete terms
function acsearch(searchTerm) {
var results = index.query(function (q) {
// exact matches should have the highest boost
q.term(searchTerm, { boost: 100 })
// wildcard matches should be boosted slightly
q.term(searchTerm, { boost: 10, usePipeline: true, wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING })
// finally, try a fuzzy search, without any boost
q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
});
if (!results.length) { return ""; }
return results.map(function(v, i, a) { // extract
return Object.keys(v.matchData.metadata);
}).reduce(function(a, b) { // flatten
return a.concat(b);
}).filter(function(v, i, a) { // uniq
return a.indexOf(v) === i;
});
}
// install jquery autocomplete widget
$('#query').autocomplete({
appendTo: '#dlg',
minLength: 3,
source: function(inp, out) {
out(acsearch(inp.term.toLowerCase()));
}
});
The resulting list reads much the same -- though now it is sorted by score and some new fuzzy terms appear. Performance on my desktop is also good. I think this is showing all that can be pulled from an inverted index - anything more would need a mapping to the unstemmed original text and that is beyond the scope of this experiment, at least for now. Thank you very much indeed!
@hungerburg glad you managed to make some progress. Indeed the stemming does make implementing really good autocomplete difficult.
Just thinking aloud, but you can get Lunr to create and store the mapping to the unstemmed words, though it will likely increase the index size considerably.
During indexing you can add a pipeline function that stores the unstemmed words as metadata for a token. The concept is discussed a bit in the guides.
You would need to add a pipeline function before the stemmer:
var storeUnstemmed = function (token) {
token.metadata['unstemmed'] = token.toString()
return token
}
The idea would be that you would then get the unstemmed words from the search results, the guide has a more detailed example.
First prize would be able to have an algorithmic unstemmer, but I don't think its possible.
Closing this as it has gone stale, feel free to comment or re-open if you still have any unanswered questions.
First, a HUGE thanks to both @hungerburg and @olivernn for this. I combined both suggestions and it's working great. For anyone wanting to do the same, this is what worked for me...
I'm indexing like this:
// Store unstemmed term in the metadata. See:
// https://github.com/olivernn/lunr.js/issues/287#issuecomment-322573117
// https://lunrjs.com/guides/customising.html#token-meta-data
const storeUnstemmed = function(builder) {
// Define a pipeline function that keeps the unstemmed word
const pipelineFunction = function(token) {
token.metadata['unstemmed'] = token.toString();
return token;
};
// Register the pipeline function so the index can be serialised
lunr.Pipeline.registerFunction(pipelineFunction, 'storeUnstemmed');
// Add the pipeline function to both the indexing pipeline and the searching pipeline
builder.pipeline.before(lunr.stemmer, pipelineFunction);
// Whitelist the unstemmed metadata key
builder.metadataWhitelist.push('unstemmed');
};
const index = lunr(function() {
this.use(storeUnstemmed);
...
});
And modified the autocomplete function suggested by @hungerburg to use the unstemmed words like this:
autoComplete(searchTerm) {
const results = this._index.query(function(q) {
// exact matches should have the highest boost
q.term(searchTerm, { boost : 100 })
// wildcard matches should be boosted slightly
q.term(searchTerm, {
boost : 10,
usePipeline : true,
wildcard : lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
})
// finally, try a fuzzy search, without any boost
q.term(searchTerm, { boost : 1, usePipeline : false, editDistance : 1 })
});
if (!results.length) {
return "";
}
return results.map(function(v, i, a) { // extract unstemmed terms
const unstemmedTerms = {};
Object.keys(v.matchData.metadata).forEach(function(term) {
Object.keys(v.matchData.metadata[term]).forEach(function(field) {
v.matchData.metadata[term][field].unstemmed.forEach(function(word) {
unstemmedTerms[word] = true;
});
});
});
return Object.keys(unstemmedTerms);
}).reduce(function(a, b) { // flatten
return a.concat(b);
}).filter(function(v, i, a) { // uniq
return a.indexOf(v) === i;
});
}
Thanks!
Not so much an issue with lunr, which is great! More a quick try to get ideas going…
In a shell with the jq utility I pull my terms from the lunr index in advance:
jq '[.index.invertedIndex[][0]|scan("^\\w{3,}")]|unique' index.json > iindex.json
I can feed that to http://api.jqueryui.com/autocomplete/ widget like belowNot fully nice, but works acceptably so far. Now truly nice would be to create the autocomplete index on the client and have the term to match processed by the indexer instead of that crude normalizer above.