Preserve tags during tokenization

olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

http://lunrjs.com

MIT License

8.96k stars 548 forks source link

Preserve tags during tokenization #69

Closed robrigo closed 9 years ago

robrigo commented 10 years ago

The regex substitutions on each token would truncate the open or closing bracket of a tag, as they aren't discluded by \W. This caused issues for me when I was trying to get my custom htmlStripper() pipeline function to work, which strips out HTML tags to provide more accurate results, because it wouldn't be able to match partial HTML tags. I propose that these be added to core so anyone who could be indexing tokens that include tags won't have to dig into the tokenizer and patch it themselves.

chadananda commented 9 years ago

Is this pull request due to be integrated anytime soon? HTML parsing creates a huge mess of tokens with nasty partial tags. HTML attributes are being added to the index and cannot be removed accurately in the pipeline.

olivernn commented 9 years ago

I'm not sure that the built in tokeniser should deal with html tags at all. Ideally this could be provided by a plugin, much like pipeline functions, the problem is that the tokeniser runs before the pipeline and has no easy way of being extended.

I think the simplest way to solve this particular problem is to strip all html tags from documents before trying to index them with lunr. There does seem to be a need of having the tokeniser be a little easier to extend though and I will have a think on how best to achieve that.