Provide context - Githubissues

gjtorikian commented 10 years ago

This is a start to resolving #25.

Right now it seems to be pretty easy to find the original position of a matching token, as I've done here. What I'm having a hard time doing is getting the original, untokenized document back. It seems to be impossible currently. I'm very new to the code, though, and I maybe be overlooking something.

What I'd like to do is this:

Given a text of: Professor Plumb has a green plant in his study

And a search of: green plant

I want to get the index of the matched token (in this case, 24).

From there, I'd like to devise some sort of extraction, so that the result returned to the user is actually something like a summary: ...has a green plant in his.... This way, the reader can kind of get a "preview" of what the search yielded in the doc. (This is probably best handled outside the core code.)

olivernn commented 10 years ago

Thanks for taking a look at this.

The way I see it there are a couple of things that need to change before this feature can be added. Trying to get the positions of tokens is a little harder than it first seems. You have to assume that by the time the incoming document has passed through the pipeline it bears little resemblance to the original text of the document. This is due to stemming, stop word removal or any other form of text processing the user has added to the pipeline.

This means that, as part of the pipeline, probably the very first step, the start index of the token in a string, must be tracked. Currently tokens have no meta data attached to them as they passthrough the pipeline, this would need to change, and this is potentially incompatible with any pipeline function anyone else has created. There are two ways (that I can think of now) to attach this meta data to a token.

Instead of just passing around strings, create a lunr.Token type that wraps this string. A lunr.Token could then have any amount of metadata, including its position, attached to it. Alternatively we could just take advantage of JavaScripts openness and add some properties to the existing token strings.

The first way, creating a lunr.Token type, seems like the Right Way™, tokens are a fairly important aspect of the library, it sort of makes sense to give them their own type. The downsides are that it will require more change, both in lunr and any user code in pipelines.

Just adding extra properties onto strings would allow us to attach the position meta data, however it feels a bit sloppy to me, on the upside it would remain backwards compatible with any end user code in the pipeline.

Currently I'm leaning more towards creating a lunr.Token type…

With the correct position of the matching token returned from the search, some additional code, perhaps a plugin for lunr, could then handle storing the original document, and then using the position to generate a useful preview of the search result.

I'd be very interested in your thoughts on this, or any other suggestions you might have.

gjtorikian commented 10 years ago

I think I follow what you're saying. You're basically against polluting the String.prototype namespace and in favor of generating another data structure to handle all the metadata, right?

While I think backwards compatibility is important, I think progress is more important (:grinning:), and not mucking around with a language's internals is best. A bump of a major revision should basically guarantee not to brake < 0.4.x releases of lunr.js.

I'll see if I can sit down and figure out some way to achieve this. Thanks for your feedback!

olivernn / lunr.js

Provide context #58