olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

Get callback with found words? #200

Closed julkue closed 8 years ago

julkue commented 8 years ago

When using

idx.search("search")

I would like to get an return value with an array containing all the words that were found in the entry, like:

[{
    "ref": 1,
    "score": 0.87533,
    "words": ["searching", "search", "searched"]
}]

This would give me the opportunity to e.g. highlight exactly those words which were found. Is there a way?

My problem: When using lunr in combination with something that handles search results, like a highlighting component, inconsistent behavior will happen. For example if you search for "searched" all entries with "search" or "searched" will be found. However, just "searched" will become highlighted (because those components may not contain things that lunr has built in, like stemming). So I need a way to determine exactly those words which were found in the entry by lunr. Rebuilding the algorithm seems quite excessive and unnecessary to me.

olivernn commented 8 years ago

lunr does not keep the unstemmed words for a document around, so any highlighting tool would need to be able to highlight based on the stem of the word, in your example "searched" will actually never appear in the index, it will be stemmed to just "search" (or whatever the stemmer reduced it to).

For this reason I think the only way to solve this is to be able to return the positions of the tokens in the document, this should also lead to far greater highlighting performance, rather than having to search through the entire text of a document to find the terms to highlight.

julkue commented 8 years ago

I think rebuilding the algorithm of lunr would cause unwanted issues:

Therefore I would like to exclude this option.

Isn't there an option to hook into lunr to save the unstemmed words? What exactly do you mean with position? The character index inside an element?

julkue commented 8 years ago

:balloon: I just saw that we've got a anniversary, the 200th issue :bowtie:

olivernn commented 8 years ago

Hehe, yes 2x :100:

I think, as it stands, the current implementation of lunr has taken us as far as it's possible to go. As you say, this, and other features, are kind of incompatible with the current data structures in use, and I'm hesitant to try and shoe horn in features like this.

A better approach is rethinking some of the internals to provide better support, which is what I'm currently working on, as I'm sure you understand though this is not exactly quick and time is always limited.

Providing a means for storing the positions of terms in the index allows for highlighting matched terms as well as querying by locality. The changes to implement this also open up third-party code the ability to store other meta data about tokens that might open up some other interesting possibilities.

At a high level, I'd like to have positions stored something like this:

    'document-id': {
      'tf': 0.1234,
      'positions': [{
         'start': 3,
         'length': 6
       }]

I know this doesn't really solve you're problem right now, but I think, long term, its the right decision.

julkue commented 8 years ago

Oliver, I didn't get you in that position thing. What exactly is this position about? Is it the position in the whole text of the document (stripped HTML)?

olivernn commented 8 years ago

Yes, so you would know the character in text that a token started at, as well as how long it is. This should be enough to wrap those characters in some markup that can show the highlighting.

For example, in the following document:

index : 012345678
doc   : hurp durp

hurp would have a start index of 0 and a length of 4 durp would have a start index of 5 and a length of 4

julkue commented 8 years ago

@olivernn I am willing to to implement an adapter, but I need your help in order to do this!

kanatzidis commented 8 years ago

+1 to returning match positions. It would be easy to roll our own highlighting with that. How far is the current implementation from doing that?

kanatzidis commented 8 years ago

I'm not sure how you would do it operating on dom elements. In my use case I have plain text which I can modify with spans that add a highlight style. Why do you need the index relative to the span? If you can figure out that the word you're looking for happens to be a child of the element, you can find it in element.children, then that innerText would be relative to the span

julkue commented 8 years ago

@kanatzidis You're absolutely right, I've had a wrong way of thinking. Sometimes it helps to talk with others. So if lunr would return the index of matched terms an adapter could seek it out (substring()). Then a highlighting component could be called with that term.

So the only left question is, @olivernn do you need any help to realize this soon?

olivernn commented 8 years ago

Sorry for the lack of response, I've been spending some much needed time away from a computer!

The wrapping idea came from a library a friend create a while ago: https://github.com/benpickles/wapper, it is slightly old now, but should give some ideas for implementing the wrapping of characters in some dom element.

I'm actively working on the changes required to provide the positional information of tokens, the change is part of a larger re-work of the internals of the library. I can't commit to a date etc, but it is in progress and I hope to share what I have as soon as possible.

julkue commented 8 years ago

@olivernn Any update on this?

olivernn commented 8 years ago

I've been working on some changes that should enable what you need. I'm pretty close to getting at least a preview build out with some documentation, I'll be sure to update this issue when that happens.

At a high level what will happen is that the tokenizer will keep track of the positions of a token within a document, this can then be returned along with some other details about the matched term with the search results. My hope is that this should provide all the information possible to enable users to highlight matched words.

MykolaGolubyev commented 7 years ago

Good read here. Any updates on it? Thanks!

olivernn commented 7 years ago

Wow, how time flies.

As discussed earlier in this thread, I've been working on a new version of lunr that will support highlighting matched terms.

I've finally got round to putting together an alpha release, as well as an example showing how the highlighting will work.

This is an early alpha release, but should be fairly stable. I wouldn't recommend using it in production yet, but it should be possible to start testing using it, install with $ npm install lunr@alpha.

The example repository shows how to use this new version, the interface is largely the same as the current version of lunr, but with a few small differences.

Specifically the code for highlighting search terms, which is not part of lunr, is here.

The plan is to get some feedback on this (and other) alpha releases before getting a final release together. The more people who test it out the better.

julkue commented 7 years ago

I'm wondering why you've reinvented the highlighting functionality in the wrapper file? There is already e.g. mark.js out there which is specifically built for this task.

You're using createTreeWalker with an acceptNode property. I don't think this will work in IE9. mark.js on the other side is heavily cross-browser unit-tested and knows this and further issues.

olivernn commented 7 years ago

The wrapper.js script is an example of how to use the data that is now returned by lunr to highlight words. It is by no means intended to be a production ready library that is ready to use out of the box.

I wasn't aware of mark.js, having taken a look at it it seems that it will not be compatible with the results from lunr. Unless I'm mistaken it assumes it will be passed words to highlight, but, as mentioned earlier in this thread, that is something that lunr cannot provide efficiently.

julkue commented 7 years ago

Well, there would be a way as mark.js converts matches to start and end positions internally too. But I'm still waiting for a stable public release to investigate if a lunr.js plugin would be helpful here.

olivernn commented 7 years ago

The intention of this alpha release is to get feedback on the APIs and interfaces, so your input on how this would integrate with mark.js would be very welcome and I'm happy to work with other library developers to make this easier, let me know how I can help.

olivernn commented 7 years ago

One last thing, lets keep the discussion going in #25 so its all in one place.