Lunr with a large index (800,000 items)

astromme commented 8 years ago

I'm trying to use lunr to index CCEDICT in the browser. This takes over 15 seconds. I've attached a CPU Profile, it looks like most of the time is spent in lunr.SortedSet.add with a bunch in lunr.TokenStore.add and a big chunk of garbage collection. Is there anything I can do to speed up my indexing? Thanks!

My indexing code:

  var lineRegex = /(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+\/(.*)\/\s*$/;
  function parseCCEDICT(data) {
    dictionaryStore = {};
    dictionaryIndex = lunr(function () {
      this.field('simplified')
      this.field('traditional')
      this.field('pronunciation')
      this.field('definitions')
      this.ref('id')
    })

    var lines = data.split('\n');
    for (var i=0; i<lines.length; i++l) {
      var line = lines[i];
      if (line.startsWith('#') || line === '') {
        continue; // skip comments and blanks
      }
      var match = lineRegex.exec(line);
      if (match !== null) {
        entry = {
            simplified: match[1],
            traditional: match[2],
            pronunciation: match[3],
            definitions: match[4].split('/'),
            id: i
        }
        dictionaryIndex.add(entry);
        dictionaryStore[i] = entry;
      } else { console.log("Invalid line format for line" + line); }
    }
  }

CCEDICT: http://www.mdbg.net/chindict/chindict.php?page=cedict

CPU-20160727T211933.cpuprofile.txt

olivernn commented 8 years ago

Thanks for the detailed issue!

I'm sure 15 seconds waiting to index in the browser seems like a lifetime, especially if this is done on the UI thread. That said, if the profile you included in the issue is for the full 800K documents, that works out at 22K 'inserts' per second (the CPU profile seemed to indicate indexing took ~35s). I don't think that is too bad, even if it is all in memory.

The simplest thing you can do is just not do the indexing in the browser, you can pre-build the index, assuming it is fairly static data, and then just load the serialised index. This doesn't make the indexing any quicker, it just caches the result. If you can't pre-build the index, then you can (and probably should) perform the indexing in a worker thread if possible.

Okay, with the two 'easy' options out of the way that leaves us with doing some real work. The CPU profile indicates that lunr.TokenStore#add is dominating the time. If you look at the source it is relatively simple. The token store is just a trie, for each token the characters are iterated and inserted into the tree structure.

I can think of a couple of things to possibly make this method quicker:

Remove the recursion, there is a non-trivial overhead in calling functions, this would be the first thing I would try
Stop keeping track of the number of tokens stored in the tree
Cache the lookup of root[key], its accessed twice on every function call and the object might end up containing a lot of keys

I would imagine that removing the recursion would be the biggest gain here, that said, I'm doubtful (though hopeful) that there is much scope for improvement, it looks like each individual call to the function is taking 0.1ms.

Let me know how those suggestions above work out, and if you do see some improvements then please submit a PR.

As an aside, thanks for introducing me to an interesting data set that I can use for stress testing lunr!

astromme commented 8 years ago

Thanks for the suggestions! Yep, the profile was for building the entire index, plus a few seconds overhead for loading the original dict file into memory.

For others following along in the future, I used node.js to create and dump my index, see code at the end of this comment. This resulted in a 61 MB index.json and a 18 MB store.json, compressed to 9.5 MB and 4.5 MB, respectively, with gzip. The original CCEDICT is 8.4 MB uncompressed, 3.3 MB compressed.

Reading back in the index.json and setting up a Lunr object on my Macbook 12 (1.2Ghz, 2016) takes 5.6 seconds. Vastly better than the 35 seconds, though I would love to find a way to further reduce the time.

dump.js

var fs = require('fs');
var lunr = require('./lunr.js');
var path = require('path');

var filePath = path.join(__dirname, 'dicts/cedict_1_0_ts_utf-8_mdbg.txt');

fs.readFile(filePath, {encoding: 'utf-8'}, function(err,data){
    if (!err){
      var dict = parseCCEDICT(data);
      console.log("dumping dictionary");
      fs.writeFile("index.json", JSON.stringify(dict.index), function(err) {
          if(err) {
              return console.log(err);
          }
          console.log("wrote index.json");
      });
      fs.writeFile("store.json", JSON.stringify(dict.store), function(err) {
          if(err) {
              return console.log(err);
          }
          console.log("wrote store.json");
      });
    }else{
      console.log(err);
    }

});

var lineRegex = /(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+\/(.*)\/\s*$/;

function parseCCEDICT(data) {
  dictionaryStore = {};
  dictionaryIndex = lunr(function () {
    this.field('simplified')
    this.field('traditional')
    this.field('pronunciation')
    this.field('definitions')
    this.ref('id')
  })

  var lines = data.split('\n');
  for (var i=0; i<lines.length; i++) {
    if (i % 10000 == 0) {
      console.log("finished with " + i);
    }
    var line = lines[i];
    if (line.startsWith('#') || line === '') {
      continue; // skip comments and blanks
    }
    var match = lineRegex.exec(line);
    if (match !== null) {
      entry = {
          simplified: match[1],
          traditional: match[2],
          pronunciation: match[3],
          definitions: match[4].split('/'),
          id: i
      }
      dictionaryIndex.add(entry);
      dictionaryStore[i] = entry;
    } else { console.log("Invalid line format for line" + line); }
  }

  return { index: dictionaryIndex, store: dictionaryStore };
}

load.js (first you need to gzip the dumped index file)

var fs = require('fs');
var lunr = require('../firebase/public/lunr.js');
var path = require('path');
var zlib = require('zlib');

const gunzip = zlib.createGunzip();

index = {};

fs.readFile('index.json.gz', function(err,data){
    if (!err){
      zlib.gunzip(data, (err, buffer) => {
        if (!err) {
          console.log("ungzipped");
          var indexDump = JSON.parse(buffer);
          index = lunr.Index.load(indexDump);
          console.log(index.search("hello"));
          console.log("index complete");
        } else {
          console.log("error ungzipping");
        }
      });

    }else{
      console.log(err);
    }

});

meltuhamy commented 7 years ago

You could also potentially use request idle callback when building the index so that the main ui isnt blocked https://developers.google.com/web/updates/2015/08/using-requestidlecallback

olivernn commented 7 years ago

@meltuhamy requestIdleCallback looks pretty cool, thanks for the tip.

nkuehn commented 7 years ago

if you can live or work around browser restrictions a shared web worker could be interesting for you ( https://developer.mozilla.org/en/docs/Web/API/SharedWorker ) - kind of a re-used thread across pages of the same domain and protocol. So the browser would not re-load the index on every page switch.

olivernn commented 7 years ago

@nkuehn that sounds quite promising.

Additionally, I'm working on a binary encoding for the index which I'm hoping will reduce the size of the index that needs to be sent across the wire.

This will also likely require more modern browsers, specifically it will require support for ArrayBuffer and friends.

nkuehn commented 7 years ago

Binary sounds interesting, too.
But I would test thoroughly if if really brings the improvements wished for, especially after being gzipped (after being gzipped the sizes are often pretty comparable and you haven't gained a lot and actually added overhead since it's not possible to implement faster deserialization than browser native JSON deserialization in JS ).

And in any case you need to strip down the actual data structure to the bare minimum.

olivernn / lunr.js

Lunr with a large index (800,000 items) #222