too much main-thread blocking / starving / maybe we are using ASIO pretty badly

This is a tracking bug for a handful of related issues, which may or may-not get a unified treatment. Picking up from #591. Additional dependent bugs should be filed as necessary / as the issues are addressed.

Roughly speaking: we block the main thread too easily, which causes TCP connections to idle out / drop data, and/or the SCP state machine to take too long to track/participate in voting, at which point it feels that it's desynchronized and we fall back out to running catchup. There are a few egregious cases and a few approaches we might consider.

At least one invariant (BucketListIsConsistentWithDatabase) is a blocking call of unbounded duration. At very least this should be made into a Work class that can be stepped along.
Similarly, the grain of work done by Maintainer is theoretically controlled by dials the user can set, but they can be set to values that cause blocking, and are in general hard to set right.
Similarly, the grain of work done by LedgerTxnImpl::deleteObjectsModifiedOnOrAfterLedger is unbounded and can cause catchup to stall out, should be broken up into Work.
The grain of work done by BucketApplicator::advance is fixed to the same size (LEDGER_ENTRY_BATCH_COMMIT_SIZE) as LedgerTxnImpl but this might be too big if we're doing a lot of these in the middle of a catchup. Possibly a better design involves being able to ask the ASIO queue / VirtualClock if we've run the current work unit for long enough that we've exhausted a virtual time slice and should yield. Ideally in some way that preserves as much determinism as possible (especially when running in VIRTUAL_TIME mode -- eg. it should answer "yes" if there's any other pending IO or enqueued work scheduled).
Alternatively / additionally: we might just need to be doing more work on the delayed work queue. work/WorkScheduler.cpp currently calls self->mApp.getClock().getIOContext().post() directly, rather than interacting with the delayed work queue. It might make sense to move some or all of Work's work to the delayed work queue.
Moreover: the delayed-work queue itself could probably use improvement. VirtualClock::crank prioritizes up-to one single block (currently 100 elements) of work in the ASIO io_service internal event queue before adding everything enqueued in its own delayed-work queue (mDelayedExecutionQueue). This means that if a lot of work has built up on the delayed-work queue, it'll all get executed at once, ahead of the next non-delayed work in ASIO -- not ideal. Better would be a design that considers work latency more explicitly, and ensures it's only taking a certain maximum time-slice from latency-sensitive ASIO "actual IO" events to do any posted work ("delayed" or otherwise).
Along those lines: it's also possible at some point that a more sensible design would include splitting network service into its own thread. The semantics of the overlay (as far as the rest of the program is concerned) is a virtual broadcast medium with callbacks from certain classes of received message. It could very well be that running all the other TCP-connection-management, tx/rx-buffer-management, flood-propagation and so forth in a separate ASIO work-queue / thread -- that posts work over to the main thread only when relevant conditions occur -- is a better design. The only tricky thing with this is that the SCP state machine is not-exactly "network service" but it is real-time latency-sensitive; we can't just introduce unbounded delays into it and expect it to keep working. So "where we put SCP" is a big question in any such split, possibly a complicated enough question to avoid such a redesign at this point.
The grain of work done by LedgerTxnImpl when talking to the database is fixed at LEDGER_ENTRY_BATCH_COMMIT_SIZE rows to a batch; this is harder to change not because it's a hard number to change but because changing it doesn't change the synchronous nature of a commit, which is semantically blocking its caller -- it happens mid-way through EXTERNALIZE and shouldn't be interleaved with other network-triggered events aside from incoming or outgoing buffer management. So again, this possibly suggests splitting network service into its own thread, but says nothing about (and is in fact representative of!) the issue with SCP state-transitions being real-time latency-sensitive.

Thinking & researching a bit more about this: ASIO (annoyingly) seems to lack the ability to tell us whether it's got IO work to do without actually doing a unit of it (I suspect this is due to its requirement to work atop windows IOCP) and so we can't "yield from work when it's time to do IO". So I think we might need to do the simpler thing(s):

Put a scoped real-time timer around each io_context.poll_one() call such that a long-running posted-work unit could yield after a real-time scheduling quantum. Then we can measure and warn on polls that last too long -- assuming that the IO-triggered ones will all naturally complete quickly, and the posted-work ones will artificially yield within a quantum. In VIRTUAL_TIME mode I .. guess we'll just schedule a clock-advance pseudo-event one quantum ahead, every time we yield.
Make the "delayed work" mechanism into a more honest dual-priority run queue, where we only post a low-priority work item every N high-priority IO events (when IO events are actually present -- we'd still soak up time slices if there's no IO occurring).

Another place this general family of issue crops up is in the simulation code, where we used to run multiple applications on the same virtual clock but have (as of https://github.com/stellar/stellar-core/pull/1390) switched to using multiple virtual clocks and a secondary outer advance loop. I think this is probably the wrong approach, and what I'd far prefer to see is a world with a single advancing clock per simulation and (again) a dual-queue structure where the application has a notion of work-units-to-do and the ASIO queue only makes a single call per timeslice to the application to "do a unit of work".

(This would also let us more-faithfully represent the sorts of simulation tasks that the various "crank-some" and "crank-until" helpers are trying to do -- we could accurately differentiate waiting for a barrier at the application level from waiting for a barrier at the IO level)

Note: a fair amount of work has been done in this direction in PR https://github.com/stellar/stellar-core/pull/2501

Two remaining points came up while reviewing that:

There are still some places in the code that schedule themselves on fixed retry frequencies using timers rather than dumping their work into an RR queue. Specifically WorkScheduler. This is because the work involved can overshoot the YieldTimer budget that controls yielding from the RR scheduler to IO in Timer.cpp.
What we should probably do is enhance YieldTimer to take a pointer to a persistent signed budget variable, that lasts across loop invocations. The budget variable is charged when a loop runs over-budget, and when the next YieldTimer is instantiated pointing to that budget, it pays off the budget first and only proceeds if there's available residual time / iterations available. This way if a long-running RR-scheduling loop has a timeslice of (say) 100ms but runs for 500ms, the next 5 calls will return immediately (to pay off the balance of IO starvation that it "owes" to the adjacent IO loop).

stellar / stellar-core

too much main-thread blocking / starving / maybe we are using ASIO pretty badly #2304