mattico / elasticlunr-rs

A partial port of elasticlunr to Rust. Intended to be used for generating compatible search indices.
Apache License 2.0
52 stars 23 forks source link

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Open Sunshine40 opened 3 months ago

Sunshine40 commented 3 months ago

https://github.com/mattico/elasticlunr-rs/blob/29d97e4c8e91bb0d1813716fb2d1575066344d76/src/inverted_index.rs#L40-L42

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.


Related issue with mdBook.