stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.14k stars 973 forks source link

too much main-thread blocking / starving / maybe we are using ASIO pretty badly #2304

Open graydon opened 5 years ago

graydon commented 5 years ago

This is a tracking bug for a handful of related issues, which may or may-not get a unified treatment. Picking up from #591. Additional dependent bugs should be filed as necessary / as the issues are addressed.

Roughly speaking: we block the main thread too easily, which causes TCP connections to idle out / drop data, and/or the SCP state machine to take too long to track/participate in voting, at which point it feels that it's desynchronized and we fall back out to running catchup. There are a few egregious cases and a few approaches we might consider.

graydon commented 5 years ago

Thinking & researching a bit more about this: ASIO (annoyingly) seems to lack the ability to tell us whether it's got IO work to do without actually doing a unit of it (I suspect this is due to its requirement to work atop windows IOCP) and so we can't "yield from work when it's time to do IO". So I think we might need to do the simpler thing(s):

graydon commented 4 years ago

Another place this general family of issue crops up is in the simulation code, where we used to run multiple applications on the same virtual clock but have (as of https://github.com/stellar/stellar-core/pull/1390) switched to using multiple virtual clocks and a secondary outer advance loop. I think this is probably the wrong approach, and what I'd far prefer to see is a world with a single advancing clock per simulation and (again) a dual-queue structure where the application has a notion of work-units-to-do and the ASIO queue only makes a single call per timeslice to the application to "do a unit of work".

(This would also let us more-faithfully represent the sorts of simulation tasks that the various "crank-some" and "crank-until" helpers are trying to do -- we could accurately differentiate waiting for a barrier at the application level from waiting for a barrier at the IO level)

graydon commented 4 years ago

Note: a fair amount of work has been done in this direction in PR https://github.com/stellar/stellar-core/pull/2501

Two remaining points came up while reviewing that: