Open graydon opened 5 years ago
Thinking & researching a bit more about this: ASIO (annoyingly) seems to lack the ability to tell us whether it's got IO work to do without actually doing a unit of it (I suspect this is due to its requirement to work atop windows IOCP) and so we can't "yield from work when it's time to do IO". So I think we might need to do the simpler thing(s):
io_context.poll_one()
call such that a long-running posted-work unit could yield after a real-time scheduling quantum. Then we can measure and warn on polls that last too long -- assuming that the IO-triggered ones will all naturally complete quickly, and the posted-work ones will artificially yield within a quantum. In VIRTUAL_TIME
mode I .. guess we'll just schedule a clock-advance pseudo-event one quantum ahead, every time we yield.Another place this general family of issue crops up is in the simulation code, where we used to run multiple applications on the same virtual clock but have (as of https://github.com/stellar/stellar-core/pull/1390) switched to using multiple virtual clocks and a secondary outer advance loop. I think this is probably the wrong approach, and what I'd far prefer to see is a world with a single advancing clock per simulation and (again) a dual-queue structure where the application has a notion of work-units-to-do and the ASIO queue only makes a single call per timeslice to the application to "do a unit of work".
(This would also let us more-faithfully represent the sorts of simulation tasks that the various "crank-some" and "crank-until" helpers are trying to do -- we could accurately differentiate waiting for a barrier at the application level from waiting for a barrier at the IO level)
Note: a fair amount of work has been done in this direction in PR https://github.com/stellar/stellar-core/pull/2501
Two remaining points came up while reviewing that:
This is a tracking bug for a handful of related issues, which may or may-not get a unified treatment. Picking up from #591. Additional dependent bugs should be filed as necessary / as the issues are addressed.
Roughly speaking: we block the main thread too easily, which causes TCP connections to idle out / drop data, and/or the SCP state machine to take too long to track/participate in voting, at which point it feels that it's desynchronized and we fall back out to running catchup. There are a few egregious cases and a few approaches we might consider.
BucketListIsConsistentWithDatabase
) is a blocking call of unbounded duration. At very least this should be made into a Work class that can be stepped along.Maintainer
is theoretically controlled by dials the user can set, but they can be set to values that cause blocking, and are in general hard to set right.LedgerTxnImpl::deleteObjectsModifiedOnOrAfterLedger
is unbounded and can cause catchup to stall out, should be broken up intoWork
.BucketApplicator::advance
is fixed to the same size (LEDGER_ENTRY_BATCH_COMMIT_SIZE
) asLedgerTxnImpl
but this might be too big if we're doing a lot of these in the middle of a catchup. Possibly a better design involves being able to ask the ASIO queue / VirtualClock if we've run the current work unit for long enough that we've exhausted a virtual time slice and should yield. Ideally in some way that preserves as much determinism as possible (especially when running inVIRTUAL_TIME
mode -- eg. it should answer "yes" if there's any other pending IO or enqueued work scheduled).work/WorkScheduler.cpp
currently callsself->mApp.getClock().getIOContext().post()
directly, rather than interacting with the delayed work queue. It might make sense to move some or all ofWork
's work to the delayed work queue.VirtualClock::crank
prioritizes up-to one single block (currently 100 elements) of work in the ASIOio_service
internal event queue before adding everything enqueued in its own delayed-work queue (mDelayedExecutionQueue
). This means that if a lot of work has built up on the delayed-work queue, it'll all get executed at once, ahead of the next non-delayed work in ASIO -- not ideal. Better would be a design that considers work latency more explicitly, and ensures it's only taking a certain maximum time-slice from latency-sensitive ASIO "actual IO" events to do any posted work ("delayed" or otherwise).LedgerTxnImpl
when talking to the database is fixed atLEDGER_ENTRY_BATCH_COMMIT_SIZE
rows to a batch; this is harder to change not because it's a hard number to change but because changing it doesn't change the synchronous nature of a commit, which is semantically blocking its caller -- it happens mid-way throughEXTERNALIZE
and shouldn't be interleaved with other network-triggered events aside from incoming or outgoing buffer management. So again, this possibly suggests splitting network service into its own thread, but says nothing about (and is in fact representative of!) the issue with SCP state-transitions being real-time latency-sensitive.