tokenizer separator regex

StarfallProjects commented 4 years ago

Hi! I am adding some custom regex to our tokenizer.separator:

lunr.tokenizer.separator = /[\s\-\.\(\)\[\]+\A-Z]/;

According to regex101.com, [\s-.()[]+\A-Z] will match all the capital letters in DeleteStreamAsync. I would like it to split on those, so that, for example, "DeleteStream" will return results.

Everything else is matching fine. For example, the stuff I added so that DeleteStreamAsync(someParams) would be returned when searching DeleteStreamAsync. So it's splitting on ( at least. It just doesn't seem to like the A-Z.

Any suggestions/info much appreciated.

hoelzro commented 4 years ago

Hi @StarfallProjects - just to make sure I understand you correctly, you essentially want to split up DeleteStreamSync(fooBar) into a stream of tokens Delete Stream Sync foo Bar, right?

Tweaking lunr.tokenizer.separator this way will kind of work - you'll end up with a token stream consisting of elete tream ync foo ar, though. Which means searching for "delete bar" won't turn up any search results, and looking for "Delete Bar" will have the same results as "Delete Car".

If my understanding is correct and you want to use lunr for this, I'd recommend providing a custom, camel-case-aware tokenizer function to the lunr builder - I think you'd have better luck with that.

However, I've noticed a few of your issues seem to revolve around searching source code, and I'm not sure lunr is particularly well-suited to that. Is there a specific reason you chose lunr?

StarfallProjects commented 4 years ago

Thanks for the advice. I only joined the project recently - but it uses DocFX, which has lunr.js built in.

hoelzro commented 4 years ago

Ah, I see - so it might be unreasonable to just yank lunr.js out and replace it with something entirely different!

With that in mind, I'd recommend trying to write the custom, camel-case-aware tokenizer I suggested above, and see how that works for you!

StarfallProjects commented 4 years ago

Thanks!

olivernn / lunr.js

tokenizer separator regex #424