ponylang / ponyc

Pony is an open-source, actor-model, capabilities-secure, high performance programming language
http://www.ponylang.io
BSD 2-Clause "Simplified" License
5.74k stars 415 forks source link

Thundering herd scheduler problem on fan-in messaging patterns #2980

Open slfritchie opened 5 years ago

slfritchie commented 5 years ago

@seantallen had asked me to create a ticket for discussing a "thundering herd" scheduling problem that I've seen at Wallaroo Labs. For the WL app I was working on, the Pony runtime system was AFAIK working correctly. But it would be nifty if the scheduler had some flexibility to adjust to actor communication patterns like this one.

Overview of the app

Wallaroo is a network stream processor: data comes in via TCP sockets, that data triggers computation, and then (usually) transformed data is sent out via TCP. The app I was working on looks something like this:

socket --> TCP actor -+-> analysis actor --+
socket --> TCP actor -+-> analysis actor --+
...                   ^                    |
100s of sockets       |   1000s of actors  +--> TCP actor -> socket
...                   v                    |
socket --> TCP actor -+-> analysis actor --+
  1. Input data arrives at multi-gigabit speeds via multiple sockets to multiple TCP actors. The parsed data contains several million events per second that are routed to analysis actors. There are a few hundred TCP actors.

  2. The TCP actors perform some routing to send data to the appropriate analysis actor(s) in a "fan out" messaging pattern.

  3. The analysis actors receive input data regularly but only send downstream data occasionally, e.g., once per minute when a processing time interval has elapsed. There are a few thousand analysis actors.

  4. When the processing time interval is elapsed, all of the analysis actors need to send their summary stats downstream. Note that this is the only time they send stuff downstream, and that these actors create the "thundering herd" at these regular intervals.

  5. All thundering herd members are sending their data out a single TCP socket. This is "fan in" messaging pattern. Each actor sends only one message downstream, each containing 100s of KBytes of data.

Behavior of the app on an AWS r5.24xlarge virtual machine

The AWS r5.24xlarge instance type has 96 virtual CPUs, 48 of which are "real" (i.e., not HyperThread logical CPUs). Most of the time, the Pony runtime is working very well: parsing, routing, and processing millions of events per second.

Then the time window interval elapses, and the runtime's behavior adapts to the temporary new workload. Some of the below is confirmed fact, some is speculation.

Observations during runtime

Discussion

Sean is interested in a discussion with the Pony community for how the Pony runtime scheduler might be altered to deal with this kind of fan-in messaging workload. The current scheduler behavior creates a ceiling on RAM used by the app; in practice I've seen process RSS size grow to about 20 GBytes. However, the r5.24xlarge VMs have over 768 GBytes of RAM available. It would be nice to be able to take advantage of more of that RAM. ^_^

Wallaroo Labs has a work-around: we've altered the application to avoid the thundering herd by sending individual messages to each analysis actor when it's time to send its summary stats downstream. This isn't a perfect solution, but it works well enough in practice so far ... until the input data changes shape enough to cause this slow-mo herd to thunder & stall also.

SeanTAllen commented 5 years ago

Want to note that, there are workarounds for the specific case that @slfritchie mentions, the general issue of fan-in with the backpressure system remains.

aturley commented 5 years ago

Leaving as "needs discussion during sync" because @SeanTAllen wants to discuss it with @sylvanc .

SeanTAllen commented 5 years ago

3009 is a possible solution to this problem.