silverbulletmd / silverbullet

The knowledge tinkerer's notebook
https://silverbullet.md
MIT License
2.59k stars 189 forks source link

Search Space does not find numbers #1097

Open lutzky opened 2 months ago

lutzky commented 2 months ago

Found accidentally when some of my pages were getting replaced by cloudflare's 502 error as part of testing #1091. To find these, I was using the Search Space command - apparently it omits any numeric characters from the query?

To reproduce, create a page with the following contents:

12345
barfoobaz
foobar5

Search results show:

I think this is caused here:

https://github.com/silverbulletmd/silverbullet/blob/4eae0c975b7ea75ecbfc967f17d0d6e9bd7679ed/plugs/search/engine.ts#L12-L14

I'm guessing this has to do with using the stemmer; e.g. stemmer('watches') returns watch, but stemmer('watches123') returns watcher123. If I'm right about that, could we perhaps index both the stemmed and non-stemmed variants of the word?

Maarrk commented 1 month ago

This seems to me a limitation of the default search feature.

To make it fast, it indexes the words, and then only queries that index. To make the index reasonably sized, and word and words point to the same item, the stemmer must be used. The alternative of looking through the full content of each file when searching is noticeably slower even for moderately-sized space (including using an external multithreaded program)

If I'm right about that, could we perhaps index both the stemmed and non-stemmed variants of the word?

Maybe we don't do stemming at all but always check if it's a substring of what was indexed? The stemming only works for English anyway, and this issue is another occasion where it behaves unexpectedly.

Some related discussion: