slfritchie commented 5 years ago

@seantallen had asked me to create a ticket for discussing a "thundering herd" scheduling problem that I've seen at Wallaroo Labs. For the WL app I was working on, the Pony runtime system was AFAIK working correctly. But it would be nifty if the scheduler had some flexibility to adjust to actor communication patterns like this one.

Overview of the app

Wallaroo is a network stream processor: data comes in via TCP sockets, that data triggers computation, and then (usually) transformed data is sent out via TCP. The app I was working on looks something like this:

socket --> TCP actor -+-> analysis actor --+
socket --> TCP actor -+-> analysis actor --+
...                   ^                    |
100s of sockets       |   1000s of actors  +--> TCP actor -> socket
...                   v                    |
socket --> TCP actor -+-> analysis actor --+

Input data arrives at multi-gigabit speeds via multiple sockets to multiple TCP actors. The parsed data contains several million events per second that are routed to analysis actors. There are a few hundred TCP actors.
The TCP actors perform some routing to send data to the appropriate analysis actor(s) in a "fan out" messaging pattern.
The analysis actors receive input data regularly but only send downstream data occasionally, e.g., once per minute when a processing time interval has elapsed. There are a few thousand analysis actors.
When the processing time interval is elapsed, all of the analysis actors need to send their summary stats downstream. Note that this is the only time they send stuff downstream, and that these actors create the "thundering herd" at these regular intervals.
All thundering herd members are sending their data out a single TCP socket. This is "fan in" messaging pattern. Each actor sends only one message downstream, each containing 100s of KBytes of data.

Behavior of the app on an AWS r5.24xlarge virtual machine

The AWS r5.24xlarge instance type has 96 virtual CPUs, 48 of which are "real" (i.e., not HyperThread logical CPUs). Most of the time, the Pony runtime is working very well: parsing, routing, and processing millions of events per second.

Then the time window interval elapses, and the runtime's behavior adapts to the temporary new workload. Some of the below is confirmed fact, some is speculation.

Some analysis actors finish creating their stats summaries first and send a single message downstream.
The single TCP actor for the system's output becomes congested.
The TCP actor processes 100 messages in its mailbox, but with several thousand upstream actors sending messages, it's likely the mailbox contains more than 100 messages. The output TCP actor is considered under pressure by the scheduler. (NOTE: 100 is the scheduler batch size constant defined at https://github.com/ponylang/ponyc/blob/4293efc4e6301157a57ad6041b0b06296fdd7d63/src/libponyrt/sched/scheduler.c#L15)
Any analysis actor unlucky enough to send a message to the output TCP actor under pressure is muted and no longer elegible to run.
Each muted & paralyzed analysis actor has lots of input messages being routed to it by upstream actors. The omniscient gods know that the analysis actor won't be sending any more messages downstream for another 60 seconds, but the scheduler pessimistically assumes that it will send more and thus prevents the actor from running any further.

Observations during runtime

During the normal workload, a tool like top or iostat will show that 48 CPU cores are busy running Pony code.
After the window interval elapses, the user% CPU reported drops to 1-2 CPU cores. This CPU utilization rate drop can continue (depending on input data) for 10+ seconds.

Discussion

Sean is interested in a discussion with the Pony community for how the Pony runtime scheduler might be altered to deal with this kind of fan-in messaging workload. The current scheduler behavior creates a ceiling on RAM used by the app; in practice I've seen process RSS size grow to about 20 GBytes. However, the r5.24xlarge VMs have over 768 GBytes of RAM available. It would be nice to be able to take advantage of more of that RAM. ^_^

Wallaroo Labs has a work-around: we've altered the application to avoid the thundering herd by sending individual messages to each analysis actor when it's time to send its summary stats downstream. This isn't a perfect solution, but it works well enough in practice so far ... until the input data changes shape enough to cause this slow-mo herd to thunder & stall also.

SeanTAllen commented 5 years ago

Want to note that, there are workarounds for the specific case that @slfritchie mentions, the general issue of fan-in with the backpressure system remains.

aturley commented 5 years ago

Leaving as "needs discussion during sync" because @SeanTAllen wants to discuss it with @sylvanc .

SeanTAllen commented 5 years ago

3009 is a possible solution to this problem.

ponylang / ponyc

Thundering herd scheduler problem on fan-in messaging patterns #2980

Overview of the app

Behavior of the app on an AWS r5.24xlarge virtual machine

Observations during runtime

Discussion

3009 is a possible solution to this problem.