olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

lots of unnecessary objects in invertedIndex (40% of memory) #316

Open d4l3k opened 6 years ago

d4l3k commented 6 years ago

I was trying to debug some large memory usage issues with lunr.js. I found that the lunr invertedIndex was roughly 138MB. By removing keys with empty objects I was able to cut 138MB down to 82MB which is a 40% savings!

2017-11-17-211858_616x576_scrot

2017-11-17-212119_1178x251_scrot

2017-11-17-212103_1181x252_scrot

Code to prune the index without breaking it using Proxy objects:

function pruneIndex (index) {
  const empty = {}
  const handler = {
    get: function (target, name, receiver) {
      return target[name] || empty
    }
  }

  for (const key of Object.keys(index.invertedIndex)) {
    const val = index.invertedIndex[key]
    for (const field of Object.keys(val)) {
      if (field === '_index') {
        continue
      }

      if (Object.keys(val[field]).length === 0) {
        delete val[field]
      }
    }
    index.invertedIndex[key] = new Proxy(val, handler)
  }
}

https://github.com/nwhacks/nwhacks2018_static/blob/master/components/select-hackers/select-hackers.js#L272

Presumably this would be a good idea to add into lunr.js since there's significant savings.

d4l3k commented 6 years ago

This is on v2.1.4 btw.

hoelzro commented 6 years ago

I'm guessing this is because an object is created for metadata, but that's not needed if builder.metadataWhitelist is empty: https://github.com/olivernn/lunr.js/blob/2dc9c6c6c41b1f5850f2bed0a82d8cd45835d166/lib/builder.js#L161-L176

olivernn commented 6 years ago

Yep, those objects are where metadata is stored for each term.

There is certainly scope for some substantial memory savings for in memory indexes here. The Proxy object solution is cool, but I'd like to keep ES5 compatibility for the time being.

There are a couple of approaches that might work here. The simplest to implement (I think) would be to re-use the same empty object if there is no metadata for an index. This would only require changes to the builder. Alternatively, instead of keeping an empty object, it could be replaced with null or similar, this would also require changes to the index.

d4l3k commented 6 years ago

I'm not sure if the memory usage is caused by the empty object, but more the number of duplicate key strings. I also experimented with just removing the empty keys and in the lunr.js code adding || {} anywhere it was used. Seemed to work fine, but I don't know the code base well enough to avoid any edge cases

hoelzro commented 6 years ago

@d4l3k Would you mind posting the script and data you're using to get these numbers? I'm curious to check it out!

d4l3k commented 6 years ago

I linked the script above https://github.com/nwhacks/nwhacks2018_static/blob/master/components/select-hackers/select-hackers.js

And I'm afraid I can't provide the data. It's the registration data for https://www.nwhacks.io/

drobnikj commented 4 years ago

Same issue here. It would be great to fix this. I decreased my container memory with pruneIndex function a lot.