mafintosh / hyperdb

Distributed scalable database
MIT License
752 stars 75 forks source link

Writing large quantities of data into Hyperdb can be incredibly slow #74

Open e-e-e opened 6 years ago

e-e-e commented 6 years ago

I have not looked at this since v3.0.0-0 was released, but i dont think this issue has been resolved. Looking at the code now - it looks like the small cache that was present has been removed.

As discussed on gitter last month:

I did a bit of profiling to see where the bottle neck is in importing data into hyperdb - it looks like its raf.read().

 ticks parent  name
 49282   41.9%  node::Read(v8::FunctionCallbackInfo<v8::Value> const&)
 49282  100.0%    v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*)
 49257   99.9%      LazyCompile: *fs.read fs.js:645:19
 49257  100.0%        LazyCompile: *RandomAccessFile._read /Users/x/hyper-graph-cli/node_modules/random-access-file/index.js:92:45
 48824   99.1%          LazyCompile: *Storage.getNode /Users/x/hyper-graph-cli/node_modules/hypercore/lib/storage.js:108:38
 48466   99.3%            LazyCompile: *Storage.dataOffset /Users/x/hyper-graph-cli/node_modules/hypercore/lib/storage.js:70:41

After further analysis it looks like its caused by a cache overflow issue, but I think an unavoidable one unless you store all of hyperdb in memory which is probably not a good idea or do something super clever. The hyperdb writer already has a cache - https://github.com/mafintosh/hyperdb/blob/master/lib/writer.js#L25-L26. But when putting in new entries it looks like it needs to get information about the previous nodes to construct the trie. On a large db this quickly results in evictions from the cache as each new key has to get different nodes for the hypercore.

mafintosh commented 6 years ago

If you still have it, could you add the benchmark you used to write data? So we can rerun it in the future (once this becomes a priority) and also review it

On Thu, Mar 15, 2018, 05:22 Benjamin Forster notifications@github.com wrote:

I have not looked at this since v3.0.0-0 was released, but i dont think this issue has been resolved. Looking at the code now - it looks like the small cache that was present has been removed.

As discussed on gitter last month:

I did a bit of profiling to see where the bottle neck is in importing data into hyperdb - it looks like its raf.read().

ticks parent name 49282 41.9% node::Read(v8::FunctionCallbackInfo const&) 49282 100.0% v8::internal::Builtin_HandleApiCall(int, v8::internal::Object*, v8::internal::Isolate) 49257 99.9% LazyCompile: fs.read fs.js:645:19 49257 100.0% LazyCompile: RandomAccessFile._read /Users/x/hyper-graph-cli/node_modules/random-access-file/index.js:92:45 48824 99.1% LazyCompile: Storage.getNode /Users/x/hyper-graph-cli/node_modules/hypercore/lib/storage.js:108:38 48466 99.3% LazyCompile: Storage.dataOffset /Users/x/hyper-graph-cli/node_modules/hypercore/lib/storage.js:70:41

After further analysis it looks like its caused by a cache overflow issue, but I think an unavoidable one unless you store all of hyperdb in memory which is probably not a good idea or do something super clever. The hyperdb writer already has a cache - https://github.com/mafintosh/hyperdb/blob/master/lib/writer.js#L25-L26. But when putting in new entries it looks like it needs to get information about the previous nodes to construct the trie. On a large db this quickly results in evictions from the cache as each new key has to get different nodes for the hypercore.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mafintosh/hyperdb/issues/74, or mute the thread https://github.com/notifications/unsubscribe-auth/AAW_VcZqGHyMojrS7lDoJxj7K6o-v8ziks5tekDogaJpZM4Sr4rF .

mjp0 commented 6 years ago

I'm not sure how the issue was for @e-e-e but running pull request https://github.com/mafintosh/hyperdb/pull/83 benchmarking tool, I get very slow performance compared to leveldb.

Some examples:

# batch write numKeys: 10 with hyperdb
ok ~5.38 ms (0 s + 5377136 ns)

# batch write numKeys: 10 with leveldb
ok ~816 μs (0 s + 815669 ns)

# batch write numKeys: 10000 with hyperdb
ok ~507 ms (0 s + 507371342 ns)

# batch write numKeys: 10000 with leveldb
ok ~29 ms (0 s + 29185328 ns)

# single write numKeys: 100000 with hyperdb
ok ~58 s (57 s + 837967694 ns)

# single write numKeys: 100000 with leveldb
ok ~1.99 s (1 s + 989761075 ns)

The performance is "fast enough" with small here and there updates but starts to rapidly breakdown with larger batches. However it's not entirely fair comparison because leveldown is using C++ code and hyperdb is running purely on JS but JS shouldn't be that much slower.

Any ideas whether this is due JS VM or something in the code?

mafintosh commented 6 years ago

We've only been focused on making reads fast (still work to do there) so we have some easy wins in writes. Note that because of the crypto needed to make this p2p we'll prob never be as fast as level. Having the bench is great tho :)

mjp0 commented 6 years ago

@mafintosh awesome to hear :+1: I was suspecting that crypto is probably affecting the results and at least based on a very quickly look, there's nothing in the code turning it off.