Open lutzky opened 2 months ago
This seems to me a limitation of the default search feature.
To make it fast, it indexes the words, and then only queries that index. To make the index reasonably sized, and word
and words
point to the same item, the stemmer must be used. The alternative of looking through the full content of each file when searching is noticeably slower even for moderately-sized space (including using an external multithreaded program)
If I'm right about that, could we perhaps index both the stemmed and non-stemmed variants of the word?
Maybe we don't do stemming at all but always check if it's a substring of what was indexed? The stemming only works for English anyway, and this issue is another occasion where it behaves unexpectedly.
Some related discussion:
Found accidentally when some of my pages were getting replaced by cloudflare's 502 error as part of testing #1091. To find these, I was using the
Search Space
command - apparently it omits any numeric characters from the query?To reproduce, create a page with the following contents:
Search results show:
I think this is caused here:
https://github.com/silverbulletmd/silverbullet/blob/4eae0c975b7ea75ecbfc967f17d0d6e9bd7679ed/plugs/search/engine.ts#L12-L14
I'm guessing this has to do with using the stemmer; e.g.
stemmer('watches')
returnswatch
, butstemmer('watches123')
returnswatcher123
. If I'm right about that, could we perhaps index both the stemmed and non-stemmed variants of the word?