premature termination of pony runtime under some situations

sblessing commented 4 years ago

EDIT: it has since been found that this isn't a backpressure situation as much as backpressure is making it more likely to occur. investigation is ongoing**

Running a benchmark application on large machines (> 48 physical cores), shows that the pony runtime backpressure/actor mute feature in combination with ponyint_sched_mute does lead to premature termination (i.e. terminating the pony process before actual quiescence is reached) non deterministically.

With https://github.com/sblessing/ponyc/commit/789c2906b6e887d6998abd2a713bfad52d08d54b this behavior is not observable any more.

Also, pony's behavior is far "spikier" (and slower) with actor/sched mute than without.

Behavior with actor/sched mute (Pony going to 0 meant premature termination, no segfault):

Behavior without actor/sched mute:

This needs to be investigated more closely and in a way where it can be debugged/fixed. Backpressure/thread muting/scaling does not work correctly at the moment.

SeanTAllen commented 4 years ago

Can you explain where this demonstrates premature termination? I'm not sure what you mean by "premature termination" is the same thing that I take it to mean.

sblessing commented 4 years ago

The program terminates although there is still work left.

SeanTAllen commented 4 years ago

How is to possible for a program to terminate and for the graph to continue?

sblessing commented 4 years ago

The graph is generated with an external runner that spawns the pony process. The process is restarted for every core count and run using numactl and --ponymaxthreads.

SeanTAllen commented 4 years ago

Can you work on coming up with a specific application environment that is, ideally, reproducible to show the early termination problem?

sblessing commented 4 years ago

Indeed, that would be ideal. The problem is that I can only observe it on large core count environments and its non-deterministic. The only thing I do know for sure is that the problem goes away with the above linked commit.

I am trying to find a more minimal example to provoke this. Ideally this would also show an effect on performance (at least for above application, sched mute appears to make it worse).

KittyMac commented 4 years ago

@sblessing : if you give me specific instructions on what to download/compile/run i can attempt to replicate here (i have 28 core machine)

sblessing commented 4 years ago

Thanks, i do have a 48 core box available, and I haven't observed it there. @KittyMac you can checkout https://github.com/sblessing/chat-app and just run the pony implementation at pony/ with its default arguments.

SeanTAllen commented 4 years ago

Actor muting will have a performance impact on any application that triggers it. Testing has shown in has no real measurable impact on applications that do not trigger it.

Muting does prevent runaway memory growth that can otherwise result.

Something I've wanted to do with the muting is make it "more advanced" by allowing some actors to process more messages in a batch if they are regularly overloaded. However, I've never had a good example application that has shown value in any of the ways I've attempted to it was removed from the backpressure system before it was merged. Having a couple example applications such as this would go along way to being good for working on that. And for that specific bit, doesn't even need to be many.

What is really needed:

an application that shows worse performance with muting on
demonstrates stable memory usage with muting off

Does that chat app in question meet those two criteria @sblessing, it certainly seems like point 1 is there. What about point 2?

sblessing commented 4 years ago

@SeanTAllen i'll try to give a useful answer to your question ASAP. Is the runaway memory growth you are talking about, resulting from messages piling up at the overloaded actors message queue without muting? In that case, I think this benchmark application can be of help.

SeanTAllen commented 4 years ago

@sblessing yes, where an actor is unable to keep up with the messages it is being sent by other actors and memory grows and grows. so if the application in question reaches a stable point memory wise but performance is impacted than woo-hoo! i finally have a good test case for "phase 2" of backpressure work.

sblessing commented 4 years ago

I am currently running a memory inspection with muting turned on and without muting turned off. I should be able to answer this later today.

KittyMac commented 4 years ago

@sblessing what is the expected output? what am i supposed to see if it terminates permaturely?

sblessing commented 4 years ago

@KittyMac you would see a statistics output if everything worked correctly, and if not, you would see only the first benchmark header line.

That is,

OK:

./pony
Benchmark                                  i-mean          i-median           i-error          i-stddev
Chat App                               200.321 ms        213.985 ms        ±8.2943 %           47.9541
                                           j-mean          j-median           j-error          j-stddev              quality of service
Turns                                  171.579 ms          198.5 ms       ±1.42148 %           39.8198                         39.2816

None                108770
Invite              281920
Post                 74025
Compute             576829
Leave                72569
PostDelivery       1195146

Broken:

./pony 
Benchmark                                  i-mean          i-median           i-error          i-stddev

KittyMac commented 4 years ago

@sblessing excellent. To confirm, for the error case the executable actually exits? Or just it just hang.

When I attempt with stock pony I get the correct output as you describe. But when I try on my fork of pony I get the second, with it hanging without exit, consistently. If that's the expected error, then I will dig in and see if I can find the cause.

sblessing commented 4 years ago

@KittyMac No in my case, it always exited. It never did hang.

KittyMac commented 4 years ago

Ok, I spent a bit looking; will most probably spend a bit more time looking this weekend.

Here's what I've seen so far. These experiments are mostly on my fork, so take with a grain of salt.

when running against a debug ponyrt, I get an assert here

src/libponyrt/actor/actor.c:1225: ponyint_unmute_actor: Assertion `is_muted == 1` failed

In ponyint_unmute_actor(pony_actor_t* actor), right after an atomic fetch sub for &actor->is_muted.

Which implies to me the is_muted has an unexpected value going into that function. Perhaps it was asked to be muted twice? Anyway, I take it cautionary confirmation something screwy is indeed going on with actor muting.

if I remove the calls to ponyint_maybe_mute() in pony_sendv() and pony_sendv_single() (the ones where the actor will mute itself if it sends to an actor with full mailbox), the assert goes away and the program finishes every time.

The benchmark results running in stock pony:

Benchmark                                  i-mean          i-median           i-error          i-stddev
Chat App                               37.1577 ms        36.6524 ms       ±7.52943 %           8.07477
                                           j-mean          j-median           j-error          j-stddev              quality of service
Turns                                  33.6484 ms             26 ms       ±1.29468 %           7.11248                         7.21521

Post                227783
Ignore             1214301
PostDelivery       3275550
None                 67606
Invite               98959
Leave                82185
Compute             572043

The benchmark results running on my fork of pony:

Benchmark                                  i-mean          i-median           i-error          i-stddev
Chat App                               16.0832 ms        16.5816 ms        ±4.1761 %           1.93848
                                           j-mean          j-median           j-error          j-stddev              quality of service
Turns                                  11.2295 ms             11 ms       ±0.68923 %           1.26362                         1.28382

PostDelivery       3185236
None                 69046
Invite               98959
Post                226741
Leave                81787
Compute             572043
Ignore             1214301

@sblessing what exactly does "quality of service" entail? Are higher scores better or worse?

SeanTAllen commented 4 years ago

@KittyMac I'm going to ignore any debugging from your fork. It's against an unknown codebase and looking for assertion problems that might be the result of a change you've made isn't something I'm going to undertake.

I don't take any cautionary tale from it other than "something is wrong in your fork". Might it exist elsewhere? Yes, but there's been tons of production usage of the muting code in the regular runtime and a much more limited set with your fork. If you could confine debugging to the standard pony runtime, that would be appreciated.

sblessing commented 4 years ago

@KittyMac quality of service, lower is better, but thats not really the topic here :). I am happy to discuss that offline. I agree with @SeanTAllen, we would need to find an issue within stock ponyrt, and there definetly is one.

SeanTAllen commented 4 years ago

I'm very dubious that the particulary assertion that was triggered in @KittyMac's fork exists in the stock ponyrt. I can believe that @sblessing is hitting an issue, but that assertion seems really really unlikely. here's the code for mute and unmute:

void ponyint_mute_actor(pony_actor_t* actor)
{
   uint8_t is_muted = atomic_fetch_add_explicit(&actor->is_muted, 1, memory_order_relaxed);
   pony_assert(is_muted == 0);
   DTRACE1(ACTOR_MUTED, (uintptr_t)actor);
   (void)is_muted;
}

void ponyint_unmute_actor(pony_actor_t* actor)
{
  uint8_t is_muted = atomic_fetch_sub_explicit(&actor->is_muted, 1, memory_order_relaxed);
  pony_assert(is_muted == 1);
  DTRACE1(ACTOR_UNMUTED, (uintptr_t)actor);
  (void)is_muted;
}

to get that assertion in unmute without hitting in mute actor would be incredibly unlike and mean that everything is completely borked somewhere. You'd need to make changes to the scheduler guarantees about an actor being run for the unmute assert to be triggered without the mute one being triggered.

That guarantee is that an actor is only being run by 1 scheduler thread at a time. And if that is violated then everything else in Pony is violated as well. You might get some interesting performance improvements but you'd also have safety go out the window.

KittyMac commented 4 years ago

My apologies, I usually work under the assumption that more information is better than less information. I will curb myself going forward.

SeanTAllen commented 4 years ago

There's no need to apologize @KittyMac.

SeanTAllen commented 4 years ago

@KittyMac are you on ARM on X86?

KittyMac commented 4 years ago

This is on 28-core X86 machine.

I just tried reproduction using stock pony with mixed results:

With the normal hw_cpu_count, his issue doesn't reproduce for me. But that's not surprising, as he said it needed a machine with > 48 cores.
So I tried artificially raising the hw_cpu_count to 112. The program then behaves for me exactly how @sblessing describes, every time.
I don't see the assertion I saw on my fork.

I have not had time to dig further.

sblessing commented 4 years ago

@KittyMac good idea for tracking this with hw_cpu_count. I can confirm that it appears to happen every time on really large core/thread counts.

Also, for completeness, I am running the following hardware - but this won't be the issue.

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7702 64-Core Processor

SeanTAllen commented 4 years ago

@KittyMac as a bonus though of your assertion, i went back and looked at the muting code atomics with sylvan and we think there is a possible but on ARM, so you ended up probably being the result of the fix of a bug on ARM.

sblessing commented 4 years ago

@SeanTAllen can you briefly explain the ARM bug?

SeanTAllen commented 4 years ago

@sblessing no because all i got when sylvan reviewed was "hmmm, that is likely to be subtly wrong on ARM"

sblessing commented 4 years ago

I need to correct: with my commit above, disabling the muting, we see the problem far less often, but it still occurs and it seems to be related to the number of actors in the system.

SeanTAllen commented 4 years ago

@sblessing the plot thickens! are you using the cycle detector? is pony --noblock in use?

sblessing commented 4 years ago

I do, and hence not using --ponynoblock. In general, using and not using --ponynoblock is quite problematic for the process of benchmarking. Using --ponynoblock causes actors not to be collected (at least for this application), or at least their finalizers not to be called. Not using --ponynoblock slows down the process considerably, because when lots of actors are in place, ponyrt takes a long time to terminate, even though all actors are blocked.

But this is a totally different topic.

SeanTAllen commented 4 years ago

@sblessing --ponynoblock does not cause actors to not be collected in general, it would however if you have cycles.

@jemc is slowly working on porting to use the verona runtime which doesn't need the cycle detector which is... very nice for performance. I will be happy when the pony runtime no longer has a cycle detector.

Anyway, trying to recap where we stand...

1) happens with cycle detector 2) more likely with backpressure 3) more likely with higher number of actors 4) seems to be deterministic for some combination of variables involving large number of cpu cores.

Am I leaving anything out at this point?

sblessing commented 4 years ago

Yes, your list is complete. I do know about the actor GC in correlation with --ponynoblock (as this goes back a long way), but for this application in question, GC'ing with --ponynoblock caused actors which are clearly not referenced anymore (and not involved in any cycle), to be collected right after all actors in the system are blocked (rather than interleaved with normal execution). We should discuss this on maybe a sync call or on a longer 1:1 discussion, then i can elaborate a bit more on that.

Also, I think it would be possible to have multiple cycle detectors :). This would improve the current behavior a lot (I think).

SeanTAllen commented 4 years ago

Please open another issue for ", but for this application in question, GC'ing with --ponynoblock caused actors which are clearly not referenced anymore (and not involved in any cycle), to be collected right after all actors in the system are blocked (rather than interleaved with normal execution)." so we can discuss. Based on the logic for gc'ing actors when noblock is used, I find this hard to understand.

SeanTAllen commented 4 years ago

@sblessing can you try using --ponynoscale and let me know if you see the same results? or rather, it will change things, I'm wondering if without scaling if the problem goes away (would require many runs as it will change the behavior of the runtime substantially in some cases).

ponylang / ponyc

premature termination of pony runtime under some situations #3551