Open faithanalog opened 4 months ago
The flush timeout does still play a role in preventing job pileup. For replay to work, we keep a list of all jobs since the last ack'd flush, and if a downstairs goes away and comes back, we replay all the work since that last confirmed flush. The frequent flush allows us to discard any work and not have to keep that work in memory.
So, we would need to determine a new way of allowing replay to work correctly if we want to increase the flush timeout.
Background:
Upstairs periodically sends flush commands to Downstairs, even if the guest is not asking for disk flushes. Originally this was every 5 seconds. I believe, though am not sure, that it was added to prevent jobs piling up in the upstairs queue.
We adjusted this to 0.5 seconds back when we were still using the sqlite: https://github.com/oxidecomputer/crucible/blob/8757b3fb55a1763382ad7111144bc60ec72af23d/upstairs/src/lib.rs#L9675
We adjusted to 0.5 because, at the time, the cost of an extent flush required a lot of work clearing old metadata contexts out of the sqlite database. That work scaled up directly with the number of writes that had hit an extent since the last flush, and was causing some pretty terrible latency bubbles that we wanted to avoid. Bryan did some testing, found 0.5 gave the best results ( https://github.com/oxidecomputer/crucible/issues/757 ). Note that this was also pre fast-write-ack.
And it remains 0.5 to this day:
https://github.com/oxidecomputer/crucible/blob/7d6c7e1e71d0b389999be06515db855bf273989e/upstairs/src/upstairs.rs#L396
This is configurable in the VCR, but Nexus is written to always pass None and accept our default.
Why we might want to change it
Well for one thing, we are not on the sqlite backend anymore, and our new one has different flush performance characteristics. But also, we may be sending more fsyncs to zfs than the guest actually cares about. That has a cost to it.
Questions