High memory usage, because indices hold entire document

tsturzl commented 7 years ago

I'm digging through the source to address an issue we're facing with index performance(Crippling CPU usage for upserts). I noticed that you are storing the entire document in the index rather than referencing to the data on the disk(or other storage APIs). This seems incredibly memory expensive, because one would assume the index would just hold the value being indexed(as the key) and a reference to the document(as the value), rather than the value being the entire document.

This means that this doesn't offer what you'd expect from an embedded database. This limits your database size by the amount of memory you have(or are willing to dedicate to your database). This seems really counter intuitive. Why not just have the storage driver be a KV store, and use the key as a reference to the document which would be stored as the value(as JSON)? I don't see this being a huge performance impact, as the OS should be caching the file. Sure storing the entire document in the index IS going to be faster every single time, but it's NOT practical in most use cases. Of course none of this is easy without having a proper memory mapped file, but most of your storage backends(other than file storage) would basically handle this for you.

I'd say most people who use this have no idea that this is happening, hence the multiple issues filed complaining about incredibly high memory usage for large datasets. Because the entire dataset is stored in memory the moment you create a single index on the datastore. You should either fix this or make clear mention of this in the README.

JMLX42 commented 7 years ago

This is a major issue. How can we fix this?

marcusjwhelan commented 7 years ago

@promethe42 TeDB created by myself and @tsturzl. Read through the documentation for instructions on uses and abilities. Not benchmarked yet.

JMLX42 commented 7 years ago

Thanks @marcusjwhelan . We've replaced NeDB with our own fork of LinvoDB3:

https://github.com/aerys/linvodb3

We're using Google's LevelDB as a backend. The DB is not fully in memory, and we've acceptable performances. We've also added support for mongoose via a driver:

https://github.com/aerys/mongoose-linvodb3

louischatriot / nedb

High memory usage, because indices hold entire document #506