Open sblessing opened 4 years ago
Can you explain where this demonstrates premature termination? I'm not sure what you mean by "premature termination" is the same thing that I take it to mean.
The program terminates although there is still work left.
How is to possible for a program to terminate and for the graph to continue?
The graph is generated with an external runner that spawns the pony process. The process is restarted for every core count and run using numactl
and --ponymaxthreads
.
Can you work on coming up with a specific application environment that is, ideally, reproducible to show the early termination problem?
Indeed, that would be ideal. The problem is that I can only observe it on large core count environments and its non-deterministic. The only thing I do know for sure is that the problem goes away with the above linked commit.
I am trying to find a more minimal example to provoke this. Ideally this would also show an effect on performance (at least for above application, sched mute appears to make it worse).
@sblessing : if you give me specific instructions on what to download/compile/run i can attempt to replicate here (i have 28 core machine)
Thanks, i do have a 48 core box available, and I haven't observed it there. @KittyMac you can checkout https://github.com/sblessing/chat-app and just run the pony implementation at pony/
with its default arguments.
Actor muting will have a performance impact on any application that triggers it. Testing has shown in has no real measurable impact on applications that do not trigger it.
Muting does prevent runaway memory growth that can otherwise result.
Something I've wanted to do with the muting is make it "more advanced" by allowing some actors to process more messages in a batch if they are regularly overloaded. However, I've never had a good example application that has shown value in any of the ways I've attempted to it was removed from the backpressure system before it was merged. Having a couple example applications such as this would go along way to being good for working on that. And for that specific bit, doesn't even need to be many.
What is really needed:
Does that chat app in question meet those two criteria @sblessing, it certainly seems like point 1 is there. What about point 2?
@SeanTAllen i'll try to give a useful answer to your question ASAP. Is the runaway memory growth you are talking about, resulting from messages piling up at the overloaded actors message queue without muting? In that case, I think this benchmark application can be of help.
@sblessing yes, where an actor is unable to keep up with the messages it is being sent by other actors and memory grows and grows. so if the application in question reaches a stable point memory wise but performance is impacted than woo-hoo! i finally have a good test case for "phase 2" of backpressure work.
I am currently running a memory inspection with muting turned on and without muting turned off. I should be able to answer this later today.
@sblessing what is the expected output? what am i supposed to see if it terminates permaturely?
@KittyMac you would see a statistics output if everything worked correctly, and if not, you would see only the first benchmark header line.
That is,
OK:
./pony
Benchmark i-mean i-median i-error i-stddev
Chat App 200.321 ms 213.985 ms ±8.2943 % 47.9541
j-mean j-median j-error j-stddev quality of service
Turns 171.579 ms 198.5 ms ±1.42148 % 39.8198 39.2816
None 108770
Invite 281920
Post 74025
Compute 576829
Leave 72569
PostDelivery 1195146
Broken:
./pony
Benchmark i-mean i-median i-error i-stddev
@sblessing excellent. To confirm, for the error case the executable actually exits? Or just it just hang.
When I attempt with stock pony I get the correct output as you describe. But when I try on my fork of pony I get the second, with it hanging without exit, consistently. If that's the expected error, then I will dig in and see if I can find the cause.
@KittyMac No in my case, it always exited. It never did hang.
Ok, I spent a bit looking; will most probably spend a bit more time looking this weekend.
Here's what I've seen so far. These experiments are mostly on my fork, so take with a grain of salt.
src/libponyrt/actor/actor.c:1225: ponyint_unmute_actor: Assertion `is_muted == 1` failed
In ponyint_unmute_actor(pony_actor_t* actor), right after an atomic fetch sub for &actor->is_muted.
Which implies to me the is_muted has an unexpected value going into that function. Perhaps it was asked to be muted twice? Anyway, I take it cautionary confirmation something screwy is indeed going on with actor muting.
The benchmark results running in stock pony:
Benchmark i-mean i-median i-error i-stddev
Chat App 37.1577 ms 36.6524 ms ±7.52943 % 8.07477
j-mean j-median j-error j-stddev quality of service
Turns 33.6484 ms 26 ms ±1.29468 % 7.11248 7.21521
Post 227783
Ignore 1214301
PostDelivery 3275550
None 67606
Invite 98959
Leave 82185
Compute 572043
The benchmark results running on my fork of pony:
Benchmark i-mean i-median i-error i-stddev
Chat App 16.0832 ms 16.5816 ms ±4.1761 % 1.93848
j-mean j-median j-error j-stddev quality of service
Turns 11.2295 ms 11 ms ±0.68923 % 1.26362 1.28382
PostDelivery 3185236
None 69046
Invite 98959
Post 226741
Leave 81787
Compute 572043
Ignore 1214301
@sblessing what exactly does "quality of service" entail? Are higher scores better or worse?
@KittyMac I'm going to ignore any debugging from your fork. It's against an unknown codebase and looking for assertion problems that might be the result of a change you've made isn't something I'm going to undertake.
I don't take any cautionary tale from it other than "something is wrong in your fork". Might it exist elsewhere? Yes, but there's been tons of production usage of the muting code in the regular runtime and a much more limited set with your fork. If you could confine debugging to the standard pony runtime, that would be appreciated.
@KittyMac quality of service, lower is better, but thats not really the topic here :). I am happy to discuss that offline. I agree with @SeanTAllen, we would need to find an issue within stock ponyrt, and there definetly is one.
I'm very dubious that the particulary assertion that was triggered in @KittyMac's fork exists in the stock ponyrt. I can believe that @sblessing is hitting an issue, but that assertion seems really really unlikely. here's the code for mute and unmute:
void ponyint_mute_actor(pony_actor_t* actor)
{
uint8_t is_muted = atomic_fetch_add_explicit(&actor->is_muted, 1, memory_order_relaxed);
pony_assert(is_muted == 0);
DTRACE1(ACTOR_MUTED, (uintptr_t)actor);
(void)is_muted;
}
void ponyint_unmute_actor(pony_actor_t* actor)
{
uint8_t is_muted = atomic_fetch_sub_explicit(&actor->is_muted, 1, memory_order_relaxed);
pony_assert(is_muted == 1);
DTRACE1(ACTOR_UNMUTED, (uintptr_t)actor);
(void)is_muted;
}
to get that assertion in unmute without hitting in mute actor would be incredibly unlike and mean that everything is completely borked somewhere. You'd need to make changes to the scheduler guarantees about an actor being run for the unmute assert to be triggered without the mute one being triggered.
That guarantee is that an actor is only being run by 1 scheduler thread at a time. And if that is violated then everything else in Pony is violated as well. You might get some interesting performance improvements but you'd also have safety go out the window.
My apologies, I usually work under the assumption that more information is better than less information. I will curb myself going forward.
There's no need to apologize @KittyMac.
@KittyMac are you on ARM on X86?
This is on 28-core X86 machine.
I just tried reproduction using stock pony with mixed results:
With the normal hw_cpu_count, his issue doesn't reproduce for me. But that's not surprising, as he said it needed a machine with > 48 cores.
So I tried artificially raising the hw_cpu_count to 112. The program then behaves for me exactly how @sblessing describes, every time.
I don't see the assertion I saw on my fork.
I have not had time to dig further.
@KittyMac good idea for tracking this with hw_cpu_count. I can confirm that it appears to happen every time on really large core/thread counts.
Also, for completeness, I am running the following hardware - but this won't be the issue.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 128
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7702 64-Core Processor
@KittyMac as a bonus though of your assertion, i went back and looked at the muting code atomics with sylvan and we think there is a possible but on ARM, so you ended up probably being the result of the fix of a bug on ARM.
@SeanTAllen can you briefly explain the ARM bug?
@sblessing no because all i got when sylvan reviewed was "hmmm, that is likely to be subtly wrong on ARM"
I need to correct: with my commit above, disabling the muting, we see the problem far less often, but it still occurs and it seems to be related to the number of actors in the system.
@sblessing the plot thickens! are you using the cycle detector? is pony --noblock
in use?
I do, and hence not using --ponynoblock
. In general, using and not using --ponynoblock
is quite problematic for the process of benchmarking. Using --ponynoblock
causes actors not to be collected (at least for this application), or at least their finalizers not to be called. Not using --ponynoblock
slows down the process considerably, because when lots of actors are in place, ponyrt takes a long time to terminate, even though all actors are blocked.
But this is a totally different topic.
@sblessing --ponynoblock does not cause actors to not be collected in general, it would however if you have cycles.
@jemc is slowly working on porting to use the verona runtime which doesn't need the cycle detector which is... very nice for performance. I will be happy when the pony runtime no longer has a cycle detector.
Anyway, trying to recap where we stand...
1) happens with cycle detector 2) more likely with backpressure 3) more likely with higher number of actors 4) seems to be deterministic for some combination of variables involving large number of cpu cores.
Am I leaving anything out at this point?
Yes, your list is complete. I do know about the actor GC in correlation with --ponynoblock
(as this goes back a long way), but for this application in question, GC'ing with --ponynoblock
caused actors which are clearly not referenced anymore (and not involved in any cycle), to be collected right after all actors in the system are blocked (rather than interleaved with normal execution). We should discuss this on maybe a sync call or on a longer 1:1 discussion, then i can elaborate a bit more on that.
Also, I think it would be possible to have multiple cycle detectors :). This would improve the current behavior a lot (I think).
Please open another issue for ", but for this application in question, GC'ing with --ponynoblock caused actors which are clearly not referenced anymore (and not involved in any cycle), to be collected right after all actors in the system are blocked (rather than interleaved with normal execution)." so we can discuss. Based on the logic for gc'ing actors when noblock is used, I find this hard to understand.
@sblessing can you try using --ponynoscale
and let me know if you see the same results? or rather, it will change things, I'm wondering if without scaling if the problem goes away (would require many runs as it will change the behavior of the runtime substantially in some cases).
EDIT: it has since been found that this isn't a backpressure situation as much as backpressure is making it more likely to occur. investigation is ongoing**
Running a benchmark application on large machines (> 48 physical cores), shows that the pony runtime backpressure/actor mute feature in combination with
ponyint_sched_mute
does lead to premature termination (i.e. terminating the pony process before actual quiescence is reached) non deterministically.With https://github.com/sblessing/ponyc/commit/789c2906b6e887d6998abd2a713bfad52d08d54b this behavior is not observable any more.
Also, pony's behavior is far "spikier" (and slower) with actor/sched mute than without.
Behavior with actor/sched mute (Pony going to 0 meant premature termination, no segfault):
Behavior without actor/sched mute:
This needs to be investigated more closely and in a way where it can be debugged/fixed. Backpressure/thread muting/scaling does not work correctly at the moment.