Open slfritchie opened 5 years ago
Want to note that, there are workarounds for the specific case that @slfritchie mentions, the general issue of fan-in with the backpressure system remains.
Leaving as "needs discussion during sync" because @SeanTAllen wants to discuss it with @sylvanc .
@seantallen had asked me to create a ticket for discussing a "thundering herd" scheduling problem that I've seen at Wallaroo Labs. For the WL app I was working on, the Pony runtime system was AFAIK working correctly. But it would be nifty if the scheduler had some flexibility to adjust to actor communication patterns like this one.
Overview of the app
Wallaroo is a network stream processor: data comes in via TCP sockets, that data triggers computation, and then (usually) transformed data is sent out via TCP. The app I was working on looks something like this:
Input data arrives at multi-gigabit speeds via multiple sockets to multiple TCP actors. The parsed data contains several million events per second that are routed to analysis actors. There are a few hundred TCP actors.
The TCP actors perform some routing to send data to the appropriate analysis actor(s) in a "fan out" messaging pattern.
The analysis actors receive input data regularly but only send downstream data occasionally, e.g., once per minute when a processing time interval has elapsed. There are a few thousand analysis actors.
When the processing time interval is elapsed, all of the analysis actors need to send their summary stats downstream. Note that this is the only time they send stuff downstream, and that these actors create the "thundering herd" at these regular intervals.
All thundering herd members are sending their data out a single TCP socket. This is "fan in" messaging pattern. Each actor sends only one message downstream, each containing 100s of KBytes of data.
Behavior of the app on an AWS r5.24xlarge virtual machine
The AWS r5.24xlarge instance type has 96 virtual CPUs, 48 of which are "real" (i.e., not HyperThread logical CPUs). Most of the time, the Pony runtime is working very well: parsing, routing, and processing millions of events per second.
Then the time window interval elapses, and the runtime's behavior adapts to the temporary new workload. Some of the below is confirmed fact, some is speculation.
Some analysis actors finish creating their stats summaries first and send a single message downstream.
The single TCP actor for the system's output becomes congested.
The TCP actor processes 100 messages in its mailbox, but with several thousand upstream actors sending messages, it's likely the mailbox contains more than 100 messages. The output TCP actor is considered under pressure by the scheduler. (NOTE:
100
is the scheduler batch size constant defined at https://github.com/ponylang/ponyc/blob/4293efc4e6301157a57ad6041b0b06296fdd7d63/src/libponyrt/sched/scheduler.c#L15)Any analysis actor unlucky enough to send a message to the output TCP actor under pressure is muted and no longer elegible to run.
Each muted & paralyzed analysis actor has lots of input messages being routed to it by upstream actors. The omniscient gods know that the analysis actor won't be sending any more messages downstream for another 60 seconds, but the scheduler pessimistically assumes that it will send more and thus prevents the actor from running any further.
Observations during runtime
During the normal workload, a tool like
top
oriostat
will show that 48 CPU cores are busy running Pony code.After the window interval elapses, the
user%
CPU reported drops to 1-2 CPU cores. This CPU utilization rate drop can continue (depending on input data) for 10+ seconds.Discussion
Sean is interested in a discussion with the Pony community for how the Pony runtime scheduler might be altered to deal with this kind of fan-in messaging workload. The current scheduler behavior creates a ceiling on RAM used by the app; in practice I've seen process RSS size grow to about 20 GBytes. However, the r5.24xlarge VMs have over 768 GBytes of RAM available. It would be nice to be able to take advantage of more of that RAM. ^_^
Wallaroo Labs has a work-around: we've altered the application to avoid the thundering herd by sending individual messages to each analysis actor when it's time to send its summary stats downstream. This isn't a perfect solution, but it works well enough in practice so far ... until the input data changes shape enough to cause this slow-mo herd to thunder & stall also.