rust-lang / mdBook

Create book from markdown files. Like Gitbook but implemented in Rust
https://rust-lang.github.io/mdBook/
Mozilla Public License 2.0
17.3k stars 1.59k forks source link

Non-English search support #2393

Open Sunshine40 opened 2 months ago

Sunshine40 commented 2 months ago

1081 has been stuck for a while, so I tried implementing my own version.

Preview the search functionality online

Inspired by #1496.

Major implementation steps:

Unresolved questions:

duskmoon314 commented 1 month ago

Another solution is to use fzf like #2052. I think it is simpler than adding a bunch of JS with different languages. But I haven't compared the search results to see whether it is a good solution.

Sunshine40 commented 1 month ago

Update

I've fixed the major problems mentioned before.

Now the teaser works well with CJK text without spaces as word splitters.

image

And it can also return the correct result for "multiple words joined together as a single keyword", like this:

image

I added keyword highlighting in the breadcrumbs part, in case the keywords don't occur in the document body at all:

image

(Opinionated) Preserves line-breaks in teasers (but removes indentations, because those are hard to get right)

All these are the behavior of the "fallback" strategy and would currently only be enabled if the book's book.language is not English and not omitted.

Sunshine40 commented 1 month ago

How does "fallback" strategy work?

The main reason Chinese search didn't work before is a mixture of 2 facts:

  1. The search index building process of mdBook's html renderer used space character as the only word splitter, and deliberately trimmed off anything other than the 26 English letters (case insensitive) and number 0~9.
  2. Even if the search index is built with Chinese word splitting, one cannot search for a series of words, namely "phrase searching", without adding spaces between words. And with spaces added, the words are no longer required to appear one after another, and that's rarely desired (at least among Chinese users).

So I put my effort into implementing "phrase search", and with that no "Chinese word splitting" algorithm is needed.

Tokenization

Text are basically divided into 4 categories:

Ideographs

When indexed, each character is treated as a separate word.

When counted during teaser generation, 2 characters are counted as 1 word.

Ideographs like Chinese characters have a wide range of character varieties, that even using single character for index searching could narrow down the result enough to be further processed by offline JS.

Hangul syllables are technically not ideographs, but they share similar characteristics.

Emoji

Emoji Modifier Sequences and Zero Width Joiners are handled so that an emoji icon is treated as 1 word.
(This is currently not working due to upstream issues)

Non Word

These are not indexed, and not counted as words during teaser generation.

Default

These are separated by anything that is not Default text, so 中文English混合 will be tokenized into

中, 文, English, 混, 合

though there aren't spaces between them.

Teaser generation

This is a huge pain so I'll make it brief.

Filtering

For "phrase searching", one thing worth noting is that the result returned by elasticlunr might not be valid, for example with keyword 安全抽象 (auto-splitted into 安, 全, 抽, 象) would match:

……部……取……平

But it's obviously not what we want.

So the new searching strategy would be only using elasticlunr as the first pass of filtering.

It then uses Regex to apply extra filtering on the returned results.

Highlighting

Only the above mentioned Default text would be highlighted as whole words. Ideograph-like and non-word text would be highlighted as is.

Display range selection

One key difference from the existing implementaion is that the range might not be contiguous. There might be …… in the middle of it.

Another difference which is opinionated is that I choose to force the teaser to include each matched keyword at least once, so with keyword hello world, the following document body:

Hello hello and hello (10000 words omitted) hello (10000 words omitted) world. (10000 words omitted) world better.

would get a teaser like:

Hello hello and hello (14 words omitted) …… (11 words omitted) world.……

By the way, I decided that it's quite pointless to show half a clause (a sentence is composed of clauses) containing a highlighted keyword, and as a result if the book's output.html.search.teaser-word-count is too small, and there're too many matched keywords, the limit might be exceeded.

Sunshine40 commented 1 month ago

Why doesn't Emoji work?

mdBook uses elasticlunr-rs to generate search index which is then consumed by elasticlunr.js

But there's a flaw in elasticlunr-rs's implementation which makes it inconsistent with the original JS library:

elasticlunr-rs/src/inverted_index.rs
Lines 40 to 42 in 29d97e4

let mut iter = token.chars();
if let Some(character) = iter.next() {
    let mut item = self

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.

And currently mdBook cannot handle these, with or without my patch.

Sunshine40 commented 1 month ago

@ehuss what do you think about these design choices?