martinsumner / leveled

A pure Erlang Key/Value store - based on a LSM-tree, optimised for HEAD requests
Apache License 2.0
355 stars 33 forks source link

ETS and Snapshots #34

Closed martinsumner closed 6 years ago

martinsumner commented 7 years ago

The Ledger Cache is kept small (up to about 2000 entries) as when a clone is required, this cache goes through ets:to_list to allow a snapshot to exist in the clone.

For short-lived 2i queries, this is probably a poor trade off. It would probably be better to run the query against the ETS table and push the results downstream to the clone, rather than push the whole cache.

DeadZen commented 7 years ago

What about a generated module? Could generate the entries and some query functions to scan them and delete the module when the cache expires perhaps.

martinsumner commented 7 years ago

Do you have any pointers to an example of doing that sort of thing? I'd like to try and understand the idea better.

DeadZen commented 7 years ago

Sure I have a library that does part of it.. http://github.com/DeadZen/goldrush Should not be too hard to make it store and scan data internally, I would have to make a PoC

DeadZen commented 7 years ago

I'll work on it now, can't do anything else with AWS down atm .. ;p

martinsumner commented 7 years ago

OK. I'm still a bit unsure on what your suggesting, so I'm intrigued.

Just to be clear, that the snapshot is taken of the cache for the 2i query so that the query will provide a view of the database at the point in time that the query was initiated. I don't need to stream any new events that arrive after initiation of the query into the snapshot.

The Ledger Cache is an ets ordered set, and it gets flushed occasionally (after it has pushed itself to the Penciller). So by extracting from the set at the point of the query, the Bookie is free to flush the set, without any concern that there may be a clone pointing at the ets table expecting it to still be there.

The extraction at the moment is done by doing a tab2list, and this takes about 1ms. However, i will have N/3 nodes doing this concurrently - so if there's a lot of 2i queries all those 1ms CPU hits will add up. All I was intending to do was to iterate over the ets ordered set to extract only the results relevant to the query at the snapshot - on the basis that a few calls to ets:next will be cheaper than an ets:tab2list.

DeadZen commented 7 years ago

Ok I was actually going to ask if that was the case with needing to stream new events. That's good for sure.

martinsumner commented 7 years ago

https://github.com/martinsumner/leveled/pull/39

Will close this for now. Hopefully over the next few days will have some proper basho_bench results to compare with leveldb

martinsumner commented 7 years ago

In single-threaded tests, this does OK. But less so when running multiple parallel queries.

There is some possible concern about the cost of starting a query - i.e. what is the cost of a query on a near empty database with limited results to return? Is there too big an overhead of snapshotting?

DeadZen commented 7 years ago

How do I reproduce this test?

martinsumner commented 7 years ago

There's a basho_bench script I'm building for 2i. It is a work in progress at the moment. I will upload it tomorrow evening.

martinsumner commented 7 years ago

Here's the current state of the script (sorry forgot about posting this)

https://github.com/martinsumner/basho_bench/blob/mas-nhsload/src/basho_bench_driver_riakc_pb.erl

I'm going to do some more work on this Thursday.

There is a bunch of optimisation work I've undertaken which is all merged into this branch:

https://github.com/martinsumner/leveled/tree/mas-pushmem-i46/src

This has really about improving the performance,not under query load, but under the update-related load when adding more indexes.

The big differences come from this commit:

https://github.com/martinsumner/leveled/commit/2b0ec1d9cce9718283a2423265f8145275a9819f#diff-25d902c24283ab8cfbac54dfa101ad31

There is a bit of a convoluted history around what made the difference. There are two parts - one allowing for slots that contain index entries to have twice as many keys (which reduces the pace of merge activity in the tree), and secondly implementing that in a way which reduces needless object copying by the beam.

Part of the issue is that I'd made a rookie mistake in assuming that all large objects are passed as references, not just large binaries. So there has been a general history of performance-related issues from message passing large objects between processes in leveled.

Prior to this, running a 6-hour test with multiple 2i index updates for every object, led to the beam occupying 40% of RAM by the end of the test. Now the same test and the beam ends up with just 10% of RAM - with all that saved memory being used in the page cache, and presumably saved CPU time in garbage collection

martinsumner commented 7 years ago

I've rolled through a number of ideas to try and improve this, and they all hit the buffers at some stage.

One thing to note is that although the manifest is small - it must be copied very aggressively for 2i queries. When volume testing at 200 2i queries per second, there are nearly 5000 manifest copies being generated every second. Each individual version of the manifest will be copied thousands of times.

Although the manifest is small, this seems wasteful. Instead, for the first snapshot for any given process a separate worker should be started - and subsequent snapshots for that manifest should use the same worker.

martinsumner commented 7 years ago

the change to share manifests across snapshots didn't make a huge difference the first time I tested .. but I was busy rolling a lot of changes at the time.

I will re-test this. Also considering a broader overhaul of actor organisation within the Penciller. Perhaps L0 and the manifest should live under a dedicated actor for handling those things - and all queries (including mainstream non-clone queries) are handled by runners which only has the levelzero index. So there will be a permanent runner (using runner as another actor name - as in bookie's runner) for non-snapshot requests, and runner clones for query requests. The actor holding the L0 sate will have to support snapshotting of request and handling multi-version (of both L0 and the manifest).

martinsumner commented 7 years ago

Tried to have a background accumulator of the penciller memory which could be use din queries ... but this didn't help.

https://github.com/martinsumner/leveled/tree/mas-pmemacc-i34/src

Quick volume test with 2i showed a significant reduction in throughput when compared to master

martinsumner commented 7 years ago

I'm of the opinion that I should defer work on this until we can test Leveled end-to-end with Riak in OTP 20.0 (and perhaps 20.0 ++ the I/O scheduler changes). Time might be a healer.

Actual testing hasn't actually highlighted this as a major issue. Also despite lots of thinking I'm devoid of any ideas to resolve it that don't have nasty side effects.

I'm moving this to the back-burner for now

martinsumner commented 6 years ago

When testing in OTP20/21 there appears to be significant speed-ups, especially when running ct tests.

In the target environment (NHS/Riak) things are ultimately slowed by disk - so enhancements here don't appear to be relevant. Also 2i query response times show no issues.

Won't fix - for now.