martinsumner / leveled

A pure Erlang Key/Value store - based on a LSM-tree, optimised for HEAD requests
Apache License 2.0
355 stars 32 forks source link

Supervision of Clerks, Supervision in General #29

Closed martinsumner closed 3 years ago

martinsumner commented 7 years ago

During early volume tests there was an issue whereby the Penciller Clerk would take the same work twice. It would do the work the second time as expected, but when the Penciller called back to prompt deletions - the dictionary for than Manifest SQN had already been emptied.

This caused the Penciller's Clerk to crash. As this is not a supervised process, the vnode carried on working, but without merging new entries. This worked fine until the Penciller process hit high memory watermarks and crashed. The absence of the Penciller process caused the Bookie to crash on the next call, and then this crashed the vnode process.

Riak Core then worked as expected, the vnode restarted, this restarted the bookie, which reloaded the lost penciller state from the ledger - and everything went back to normal.

summary

In this test two vnodes were impacted at the start of the test, and the impact of the restart can be seen at the end.

This was initially resolved by placing a soft lock in the Penciller through the State#state.work_ongoing boolean - so that if work is ongoing it should never tee up more work.

However, this has still happened once since then.

There are a number of things to fix:

The problem could be made to go away by swapping the dict:fetch for a dict:find, but I think it is right for this to crash, as it is an unexpected event

martinsumner commented 7 years ago

With regards to the clerks it is not clear why clerks are permanent actors. They don't hold any significant state beyond the current job they're running - and require prompting to do any work.

So I think it would be worth looking at starting and stopping clerks on a job-by-job basis

martinsumner commented 6 years ago

To try and understand better what to do here, thought it would be worth looking at how other Riak backends work in terms of supervision - and what I can learn from them

martinsumner commented 6 years ago

Bitcask

Starting a bitcask app starts a supervisor which supervises two workers - a merge worker and a merge delete worker. When bitcask wants to prompt a merge these workers are called as a locally registered name.

The actual bitcask work itself doesn't seem to depend on starting a process - there is no bitcask server. On startup no pid is generated, the files are opened and key map built up through a function which creates a state of the bitcask server, a reference is created for the bitcask instance, and the state is stored against the reference. The reference is passed back to the function that called open, and then that reference is then used by that parent process when calling other functions. For example a get request is passed the reference, then the state is fetched from the reference to fulfil the get. The work to [perform the GET is managed by the process calling GET, not passed to a bitcask worker process.

Files that are opened by bitcask have a filestate record that contains their file descriptor. The filestate records are stored within the state record, which is in turn attached to the reference. So unlike leveled there is not a mapping of process to file. This means that for snapshots, the snapshot just calls open and opens up all the files again (i.e. a snapshot will have a new set of file handles).

So bitcask doesn't really have a supervision tree, as because, other than for the merge workers, there are no processes to supervise.

martinsumner commented 6 years ago

Also with bitcask. When multiple bitcask databases are in action on the same node, it looks like this is handled by starting the bitcask application only once - and swallowing the already_started condition when it occurs https://github.com/basho/bitcask/blob/develop/src/bitcask.erl#L1143-L1152. I assume therefore, that on each node, there will be only one merge_worker and merge_delete_worker started by the supervisor (so multiple vnodes cannot perform merges at the same time).

The merge_worker appears to have a queue. When riak_kv_bitcask_backend gets a callback to merge, it checks the status of the queue (https://github.com/basho/riak_kv/blob/2.1.7/src/riak_kv_bitcask_backend.erl#L469-L483), and only checks for a required merge if the queue is empty. When an actual merge is requested, a worker pid is spawned to perform the merge, and any other work will be queued if a worker PID has already been spawned for some outstanding work.

Folds for bitcask effectively re-open bitcask from scratch in read only mode, generating a new reference: https://github.com/basho/riak_kv/blob/2.1.7/src/riak_kv_bitcask_backend.erl#L348.

Closing a bitcask, just:

1 - erases the state from the reference 2 - closes a write file (if it was opened for writing) 3 - closes all the read files (note that files in snapshots have separate file handles to the actual vnode bitcask instance).

martinsumner commented 6 years ago

HanoiDB

HanoiDB starts a new gen_server when hanoidb:open is called (and the riak backend code for hanoidb uses open rather than the open_link alternative - https://github.com/basho-labs/riak_kv_hanoidb_backend/blob/master/src/riak_kv_hanoidb_backend.erl#L111).

When a file is being written, a new process is started to write the file. When a snapshot is started a new fold_worker is started (plain_fsm). The writer is started and linked to the worker process that started it (not via a supervisor). The fold_worker sets up a monitor of the hanoidb gen_server process that spawned it.

So HanoiDB doesn't use supervisors, does have multiple worker process, but relies on monitors and links between worker processes where necessary.

martinsumner commented 6 years ago

Leveled

Just some notes on leveled, wrt supervisors:

Temptation at the moment is just to substitute start for start_link. Then build better handling for different file corruption scenarios into the code cdb/sst code itself.

martinsumner commented 6 years ago

This now has switched to using start_link not start (except for when starting clones which are not linked).

https://github.com/martinsumner/leveled/pull/150

martinsumner commented 3 years ago

This has now been demonstrated to be a robust approach. So although this might not be ideal for an idealistic setup for an erlang project, there is no motivation to address this as technical debt