olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.87k stars 547 forks source link

How to diagnosis if result not match as expected? Chinese search term used. #476

Open judychu opened 3 years ago

judychu commented 3 years ago

First of all I'm using the lunr-languages plugin thus the behavior might be different. All documents, index and search terms used are Traditional Chinese.

I have successfully created a index with expected token, and there are two documents' field contain the same token, search result only present A document but not B. Here are the example:

Part of the Index

"version": "2.3.9",
      "fields": [
        "name"
      ],
      "fieldVectors": [
        [ // Assume only 2 documents added. 
          "name/roasted-chicken",
          [
            3,
            12.04,
            4,
            12.04,
            5,
            12.04,
            6,
            12.04
          ]
        ],
        [
          "name/lemon-salt-chicken",
          [
            11,
            10.923,
            12,
            10.923,
            13,
            10.923,
            14,
            10.923,
            15,
            10.923
          ]
        ]
      ],
      "invertedIndex": [
        [ // Some token is trimmed here
          "烤雞",
          {
            "_index": 4,
            "name": {
              "roasted-chicken": {
                "position": [
                  [
                    2,
                    2
                  ]
                ]
              }
            }
          }
        ],
        [
          "雞",
          {
            "_index": 14,
            "name": {
              "lemon-salt-chicken": {
                "position": [
                  [
                    5,
                    1
                  ]
                ]
              }
            }
          }
        ],
      ],
      "pipeline": [
        "stemmer"
      ]
    }

A document (roasted-chicken) and B document (lemon-salt-chicken) Name field contain chinese term "雞" (which means Chicken in English), however only B document return as a result:

ref: "lemon-salt-chicken"
score: 10.923
matchData: { 雞:{name:{"position": [ [5,1]}}|

And my reference code in Gatsby gatsby-node.js

const index = lunr(function () {
this.use(lunr.zh);

this.ref(`slug`);
this.field(`name`, { boost: 10 });
this.metadataWhitelist = ["position"];
for (const doc of documents) {
    this.add(doc);
}
});

search.js

const index = Index.load(data.RecipeIndex);
let rawsearch = index.search(q);

I know its quite difficult to troubleshoot for non-Latin language, but my only questions are:

  1. What's the number under fieldVectors in Index means? Is that something related to relevance?
  2. Any hints to find out why the A document is not returned? I guess its related to low score but don't know how to figure it now.

Any response would be very appreciated! Thanks!