Closed SeanTAllen closed 6 years ago
I unfortunately can't replicate this.
@sylvanc you are on sierra, can you try to replicate?
anyone else using sierra who can give this a try?
@jonke which version of sierra do you have specifically?
@SeanTAllen After changing let timers = Timers
to let timers = Timers(1)
the test program took 300% CPU.
$ ponyc --version
0.12.3 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)
$ uname -a
Darwin Andrews-MacBook-Pro.local 16.5.0 Darwin Kernel Version 16.5.0: Fri Mar 3 16:52:33 PST 2017; root:xnu-3789.51.2~3/RELEASE_X86_64 x86_64
And before the change @agarman ?
25% with default slop. 300% with slop of 1.
@SeanTAllen At the time of the report macOS Sierra 10.12.3
However on 10.12.4 Pony built from source ../../ponyc/build/release/ponyc --version 0.12.3-4eb7bb1 [release] compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)
$ uname -a Darwin Jonas-MBP 16.5.0 Darwin Kernel Version 16.5.0: Fri Mar 3 16:52:33 PST 2017; root:xnu-3789.51.2~3/RELEASE_X86_64 x86_64
slop 0 PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP 6322 ponyplay 199.4 00:44.20 4/2 0 16 748K+ 0B 0B 632
slop 1 6324 ponyplay 197.6 01:03.65 4/2 0 16 720K+ 0B 0B 6324
slop 20 32 ponyplay 37.6 00:04.86 4 0 16 648K+ 0B 0B 6332 6
Lowering slop values should greatly increase CPU usage.
How are you getting CPU usage? Is that a single moment in time? Is that consistently using that much CPU?
When I tested it was non-stop 300% CPU utilization.
With a standard slop value @agarman?
@SeanTAllen I reproduce same results as @Jonke At default slop (20): 15-25% CPU at slop (1): 300% CPU
Suspicion here is that Kqueue on Sierra (and maybe El Capitan) is firing. We have no proof of that but think we should look into it.
I've had a look again and kqueue isn't firing anything spurious events. The kevent call here only returns once or twice every 2 seconds with a single event every time.
Back to ponyint_cpu_core_pause
, for the first billion cycles after the thread has woken up it will busy wait for a message before going back to longer sleeps. That's about half a second of busy waiting after each timer trigger (at 2Ghz clock speed). With a 2 seconds timer interval that's 25% cpu usage on average so it matches what we see.
Using a higher frequency cpu usage tracking confirms this (see screenshot). The process is at 100% for just under half a second, then some intermittent 5-10% usage for a second, then to zero. This is definitely what I'd expect from looking at the function. .
Like I mentioned on the PR, nanosleep(0)
on Linux actually takes a lot of time, which explains why this isn't an issue there. I still don't understand why pre-Sierra doesn't have this problem since it seems like nanosleep(0)
was already fast there. (I only have Sierra so I had to ask someone else to measure it. If anyone with pre-Sierra wants to confirm this: https://gist.github.com/plietar/d0de6de87a7b709aa214788cfa5adc5f)
I see similar high cpu usage on pre-Sierra:
$ uname -a
Darwin Dipins-MBP.home 14.5.0 Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; root:xnu-2782.40.9~1/RELEASE_X86_64 x86_64 i386 MacBookPro11,2 Darwin
With the following ponyc:
$ ponyc --version
0.13.0 [release]
compiled with: llvm 3.8.1 -- clang version 3.8.1 (tags/RELEASE_381/final)
I compiled and ran the following modified program:
use "time"
actor Main
new create(env: Env) =>
let timers = Timers
let timer = Timer(Notify(env), 5_000_000_000, 2_000_000_000)
timers(consume timer)
class Notify is TimerNotify
let _env: Env
var _counter: U32 = 0
new iso create(env: Env) =>
_env = env
fun ref apply(timer: Timer, count: U64): Bool =>
_env.out.print(_counter.string())
_counter = _counter + 1
if _counter == 10 then
false
else
true
end
Run output on OS X:
$ time ./timer-issue
0
1
2
3
4
5
6
7
8
9
real 0m23.079s
user 0m3.294s
sys 0m4.541s
Same program run in an Ubuntu 16.04 VM on the same OS X host results in the following:
$ time ./timer-issue
0
1
2
3
4
5
6
7
8
9
real 0m23.036s
user 0m0.572s
sys 0m0.088s
One similar timing related issue from the Apple Darwin mailing lists is: https://lists.apple.com/archives/darwin-kernel/2007/Feb/msg00031.html. The thread seems to come to the conclusion that the extra busy waiting related cpu usage is due to the nanosleep
call returning much more quickly on OS X than on Linux.
Assuming others agree that this is the root cause, the solution would be to sleep for a specific amount of time in nanosleep
(and similarly on the Windows side) instead of sleeping for 0
or 1
nanoseconds and relying on the behavior that the OS will take significantly longer regardless and will not result in extra cpu usage since that doesn't seem to always hold true.
Can the nanosleep value used be calibrated at runtime? Otherwise hardcoded values will become bugs in later OS updates.
@agarman I'm not sure I understand your comment/concern. The call to nanosleep
is in the C code for the runtime scheduler logic for when there are no actors with any work available for the scheduler thread to run.
So far, I think we've been relying on the implicit behavior where nanosleep
for 0 nanoseconds
actually takes significantly longer than that. I think we should figure out the "appropriate" amount of time to sleep a scheduler thread if there is no work before checking for work again instead of relying on unpredictable OS behavior. Granted, this wouldn't preclude the OS from sleeping for longer than we requested but it would prevent unnecessary busy waiting and cpu burning in situations where the OS actually does a good job to trying to meet our requested sleep time.
I don't see how this would become a bug in a later OS update since we're just explicitly saying that we want to wait 10 microseconds
(for example) before we check for work again instead of saying we want to wait 0 nanoseconds
when the OS will always take longer than that. I don't think later OS updates would invalidate our decision to wait for 10 microseconds
.
Well, the above is my understanding but maybe I'm missing something?
@dipinhora -- thanks for clarification. If I understand correctly, the solution would no longer use nanosleep
of 0
nor any OS specific durations?
@agarman I'm not sure what you mean by "any OS specific durations" but my suggestion is that we move away from using 0 nanoseconds
to a "more realistic" duration that balances cpu usage with time between checks explicitly (instead of the current reliance on the implicit behavior of the nanosleep
call taking significantly longer than requested). The decision could be that the "more realistic" duration be custom/specific to each OS but that seems unlikely to be necessary.
Got it. Sounds reasonable. FWIW erts looks like it uses sleep 0, but I'm just a tourist wrt ponyc or erts c internals.
Would we take any significant performance hit if we made the sleep time configurable via a --ponywhatever=$ns
option at runtime? That is, having the value read from a static variable at runtime rather than being a compile-time constant.
Trying to explain how that interacts with things could be seriously difficult and black magic.
I'm always wary of options that are basically magic.
Is it black magic? If I'm understanding it correctly, the concept seems fairly simple.
It seems analogous to a global "slop" option, where you can increase the number for less CPU usage, and decrease it for more precise timings, with diminishing returns in either direction if you stray to far from the default.
@jemc you are referring to the values in ponyint_cpu_core_pause
correct?
If yes, then I think its black magic because its called as part of quiescent and changing that value can have a large impact on performance and work stealing.
I'm somewhere in the middle. The following are my -$0.02 8*)
I agree with @SeanTAllen that we have to be careful since this is directly related to work stealing and other scheduler internals and not very easy to understand or explain the impact.
On the other hand, I think having it be a configurable option for advanced users is acceptable and no more dangerous than some of the thread related options. This assumes that we'll choose a sensible default that we would otherwise be hardcoding into a compile time constant.
I also observe this behaviour on Ubuntu 16.04.
@adam-antonik Can you please confirm some details? It seems surprising to me that Ubuntu 16.04 has the same behavior and I'd like to investigate a bit if you're able to provide the following information:
uname -a
ponyc --version
time ./timer-issue
(after compiling the program, without -d
flag, in https://github.com/ponylang/ponyc/issues/1787#issuecomment-295013712)@dipinhora Requested details below. Difference to your output on OS X is we don't use so much user time, but sys is still too high.
Linux adam-750-279na 4.4.0-81-generic #104-Ubuntu SMP Wed Jun 14 08:17:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
0.14.0-f70dba0 [release] compiled with: llvm 3.9.1 -- cc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
`0 1 2 3 4 5 6 7 8 9
real 0m23.083s user 0m0.388s sys 0m2.136s`
@adam-antonik Thanks for the quick response. Based on my limited knowledge of linux and it's internal clocksources and timing subsystem, I don't think the behavior you're seeing is an issue with Pony itself but a side effect of how Pony relies on rdtsc
instruction (see https://en.wikipedia.org/wiki/Time_Stamp_Counter) to read the TSC value for internal scheduler thread timing. I would be very surprised if other programs relying on rdtsc
instruction don't have similar overhead to what you're seeing in Pony. It is possible that if you're on a VM that the VM is emulating the rdtsc
instruction for reading TSC and that is why system overhead is so high.
Additionally, my understanding is that all linux timing related system overhead is related to the current clocksource that the kernel is using and not all clocksources are created equal. Some are very low overhead (tsc) while others have much higher overhead (acpi_pm, I think). There's some information about clocksources in the accepted answer at https://unix.stackexchange.com/questions/164512/what-does-the-change-of-the-clocksource-influence the the links it provides.
Can you confirm the output of:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
and:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Are you running on bare metal or in a VM of some kind? If a VM, it may not be relying on a hardware clocksource but instead a software emulated one and that could have a large system overhead (I've personally seen this sort of overhead impact a latency sensitive application when the current_clocksource
wasn't TSC but I couldn't explain all of the details as to why/how it all interacts). If bare metal, it is possible the kernel detected an unstable clocksource (for TSC) and is using a less efficient one instead. This is usually possible to resolve either by overriding the current clocksource or rebooting.
This is caused by work-stealing. The problem would be to make work stealing nicer in this scenario without impacting on performance during high loads. Sylvan and I are poking at it but we've poked at it before.
@SeanTAllen I believe you're referring to this issue/ticket in general (i.e. the busy waiting related to the internal scheduler thread timing for work stealing) and not specifically about the unusually high system time
that @adam-antonik is encountering. Please correct me if I'm wrong.
I don't believe @adam-antonik's issue is related to the work stealing busy wait but something with the timing subsystem in his environment as I tried to explain in my previous comment.
I'm referring to the issue in general. There may be other factors as well.
I'm a Pony newbie, but I have a long and usually-fruitful habit of
using strace
, ktrace
, and truss
on various UNIX flavors to
observe their internals. OS X's closest equivalent is dtruss
, which
is implemented in DTrace. Here's what I found with some experimenting
on my Sierra/10.12.6 machine this afternoon.
I've had to make a copy of that script and then add the following pragma at the top of the embedded D script. The event rate is so extremely high that the DTrace internal buffers overflow.
#pragma D option dynvarsize=1024m
It still occasionally loses events ... check its output for errors
such as 346307 dynamic variable drops with non-empty dirty list
.
For example, running
the Pony refdoc's timer example
of a single timer firing every 8 seconds ... nearly every syscall
found by dtruss
is __semwait_signal()
. And it is called at
extremely high rate in cycles. It goes roughly like this:
That is an extraordinary number of system calls for a tiny app with a timer wheel with one timer in it.
A very naive experiment suggests that each nanosleep()
call is
translated into a single __semwait_signal()
system call. That holds
for extremely large intervals such as 1 second or more; I haven't
explored smaller intervals.
I've put a raw DTrace output file at http://www.snookles.com/scotttmp/1787-timer-exploration.txt for those who haven't twiddled their OS X settings to permit DTrace full access to the OS. The really ugly 1-liner script that I used to classify small/medium/large sleep times is:
% egrep 'semwait|write' /tmp/yoo6de | awk '/write/ { print "WRITE" } { if ($2 < 900) {if (l > 0) { printf("large %d usec\n", l); } print "small"; s++; m = 0; l = 0;} else if ($2 < 9999) { if (s > 0) { printf "%d small\n", s } print "medium"; s = 0; m += $2; l = 0; } else { if (m > 0) { printf "medium %d usec\n", m } print "large" ; s = 0; m = 0; l += $2} }' | uniq | egrep 'usec|count|WRITE|small'
... which prints the following. The "small" category events are merely counted, since most of them report elapsed time as 0 or 1 microseconds (too much rounding error lurking there, I think).
large 824577 usec
small
87275 small
medium 887975 usec
large 1799915 usec
small
WRITE
small
127259 small
medium 888596 usec
large 3206314 usec
small
87696 small
medium 885989 usec
large 1861788 usec
small
WRITE
small
115488 small
medium 890829 usec
large 3213300 usec
small
87633 small
medium 886477 usec
large 1902987 usec
small
WRITE
small
107248 small
medium 885446 usec
large 3207897 usec
small
85618 small
medium 888300 usec
large 1965298 usec
small
WRITE
small
97915 small
medium 886948 usec
Update: @SeanTAllen found a lovely little bug in ponyint_cpu_core_pause()
. I'll play with it a bit, but it seems very likely to be an effective fix for this problem.
Brought over from #1625
reported by @jonke. I was trying out the Timer class, running an exact copy of the version in the doc: http://www.ponylang.org/ponyc/time-Timer/#timer
very high proportional cpu usage, on macOS Sierra
$ ponyc --version 0.11.0 [release] compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1) $ bash --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16) Copyright (C) 2007 Free Software Foundation, Inc.
installed via homebrew.
uilt with every tag from 0.4 up to HEAD with both llvm 3.8 and 3.9 and could not see any difference with the timer example from the time package.
However it one changes the slop value it get a significant effect on the cpu usage, a slop = 1 gave me a cpu load 200%, a slop value =40 0.2% but then off course the timer don't work as expected. The default slop = 20 ~25% cpu.
2,7 GHz Intel Core i5