Consume: prevent loss of data in write buffers / pending batches

gavin-norman-sociomantic commented 6 years ago

The neo Consume request has the facility for the client to initiate a clean shutdown, ensuring that all records that have already been popped from the channel are flushed to the client before the request ends.

There are, however, other situations where records can be lost:

When the node is shut down. Currently there is no support for waiting for active requests to end cleanly.
When a connection error occurs. Currently, this just ends all requests on that connection.

We should consider ways to fix these cases.

(Created here, not in dmqnode, as the fix will likely require both changes to the storage engine and accompanying protocol changes.)

gavin-norman-sociomantic commented 6 years ago

One idea would be to modify the process of popping a record from the storage engine. Rather than a destructive pop, it could be a two step process:

"Reserve" the record, so that no other requests can access it.
When finished, tell the storage engine to either remove the record from the channel (if the request succeeded) or to mark it as unreserved (if the request failed).

That way, the ordering of records in the channel would not be altered.

david-eckardt-sociomantic commented 6 years ago

the fix will likely require both changes to the storage engine and accompanying protocol changes

To end Consume requests when a node is shut down a protocol change is not necessary -- just to flush the pending batch and send Stopped is sufficient, the client will handle that accordingly.

Currently there is no support for waiting for active requests to end cleanly.

An easier way of accomplishing this could be to use a timeout rather than waiting for requests to report that they have ended. On SIGINT or SIGTERM active requests could be notified somehow, then the node sleeps for, say, a second and finally proceeds shutting down.

"Reserve" the record, so that no other requests can access it.

This could be done at the storage engine level in a way that the storage engine creates the batch and writes it to a dump file on disk on shutdown if it hasn't been sent. On the other hand, many times (that is, if there are sufficiently many records in the storage) the batch is removed from the storage immediately before sending it so that wouldn't help if the shutdown intercepts sending, and there is no feedback from MessageSender when the data have been sent -- they can be buffered. It sounds like a very tricky feature to implement.

david-eckardt-sociomantic commented 6 years ago

active requests could be notified somehow

That could be a listener event, raised by the storage engine when it shuts down (Nemanja's idea).

gavin-norman-sociomantic commented 6 years ago

This could be done at the storage engine level in a way that the storage engine creates the batch and writes it to a dump file on disk on shutdown if it hasn't been sent. On the other hand, many times (that is, if there are sufficiently many records in the storage) the batch is removed from the storage immediately before sending it so that wouldn't help if the shutdown intercepts sending, and there is no feedback from MessageSender when the data have been sent -- they can be buffered. It sounds like a very tricky feature to implement.

Ahh yes, I was only thinking about the queue in memory. Having to also handle the disk queue would indeed make this more complicated.

sociomantic-tsunami / dmqproto

Consume: prevent loss of data in write buffers / pending batches #16