Open Sunshine40 opened 6 months ago
- Should we consider other search solutions than elasticlunr.js & lunr-languages?
Another solution is to use fzf
like #2052. I think it is simpler than adding a bunch of JS with different languages. But I haven't compared the search results to see whether it is a good solution.
I've fixed the major problems mentioned before.
Now the teaser works well with CJK text without spaces as word splitters.
And it can also return the correct result for "multiple words joined together as a single keyword", like this:
I added keyword highlighting in the breadcrumbs part, in case the keywords don't occur in the document body at all:
(Opinionated) Preserves line-breaks in teasers (but removes indentations, because those are hard to get right)
All these are the behavior of the "fallback" strategy and would currently only be enabled if the book's book.language
is not English and not omitted.
The main reason Chinese search didn't work before is a mixture of 2 facts:
So I put my effort into implementing "phrase search", and with that no "Chinese word splitting" algorithm is needed.
Text are basically divided into 4 categories:
When indexed, each character is treated as a separate word.
When counted during teaser generation, 2 characters are counted as 1 word.
Ideographs like Chinese characters have a wide range of character varieties, that even using single character for index searching could narrow down the result enough to be further processed by offline JS.
Hangul syllables are technically not ideographs, but they share similar characteristics.
Emoji Modifier Sequences and Zero Width Joiners are handled so that an emoji icon is treated as 1 word.
(This is currently not working due to upstream issues)
These are not indexed, and not counted as words during teaser generation.
These are separated by anything that is not Default
text, so 中文English混合
will be tokenized into
中, 文, English, 混, 合
though there aren't spaces between them.
This is a huge pain so I'll make it brief.
For "phrase searching", one thing worth noting is that the result returned by elasticlunr might not be valid, for example with keyword 安全抽象
(auto-splitted into 安, 全, 抽, 象
) would match:
形象……全部……抽取……平安
But it's obviously not what we want.
So the new searching strategy would be only using elasticlunr as the first pass of filtering.
It then uses Regex to apply extra filtering on the returned results.
Only the above mentioned Default
text would be highlighted as whole words. Ideograph-like and non-word text would be highlighted as is.
One key difference from the existing implementaion is that the range might not be contiguous.
There might be ……
in the middle of it.
Another difference which is opinionated is that I choose to force the teaser to include each matched keyword at least once, so with keyword hello world
, the following document body:
Hello hello and hello (10000 words omitted) hello (10000 words omitted) world. (10000 words omitted) world better.
would get a teaser like:
Hello hello and hello (14 words omitted) …… (11 words omitted) world.……
By the way, I decided that it's quite pointless to show half a clause (a sentence is composed of clauses) containing a highlighted keyword, and as a result if the book's output.html.search.teaser-word-count
is too small, and there're too many matched keywords, the limit might be exceeded.
mdBook uses elasticlunr-rs
to generate search index which is then consumed by elasticlunr.js
But there's a flaw in elasticlunr-rs
's implementation which makes it inconsistent with the original JS library:
elasticlunr-rs/src/inverted_index.rs
Lines 40 to 42 in 29d97e4
let mut iter = token.chars();
if let Some(character) = iter.next() {
let mut item = self
During index building, elasticlunr-rs
iterates over the token &str
's content in Unicode Scalar Values.
While the JS library does it in this way:
elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
var root = root || this.root,
idx = 0;
while (idx <= token.length - 1) {
var key = token[idx];
The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.
And currently mdBook cannot handle these, with or without my patch.
@ehuss what do you think about these design choices?
1081 has been stuck for a while, so I tried implementing my own version.
Preview the search functionality online
Inspired by #1496.
Major implementation steps:
elasticlunr-rs
's indexing implemantation (detail)Unresolved questions:
Should(Not likely, since it causes severe binary size bloat)search-non-english
feature be enabled by default?