Closed rgrinberg closed 5 years ago
Definitely would be useful for me!
I'm for this as well!
A low effort version of that work can be done non-intrusively using perf
:
perf record -g --call-graph=dwarf -F 99 dune build
perf script | /path/to/stackcollapse-perf.pl | ~/.yarn/bin/speedscope -
(not sure what's going on with these sh
processes...)
Wow, the amount of time spent in __libc_fork
is a bit scary. I wonder if we should try to vendor spawn
@emillon it's because the compiler uses Sys.system to call out to external commands such as gcc, as, ar, ...
I'm not sure how precise this graph is, below the first row it might be collapsing things that it shouldn't be (ie, possibly all subprocesses will contribute to that __libc_fork
part)
it's because the compiler uses Sys.system
right, I only grepped for "sh"
uses from within dune, but system
does use it. Once again, there might be double counts with the other tools. I'll try to collapse the stacktraces by hand to have only the processes.
I see. What would be interesting is to see the accumulated amount of time spent in the call to Unix.create_process
in process.ml
I wrote a quick implementation hooking into process.ml
.
(you can see the command names more clearly in the linked images, but roughly purple = gcc, blue = ocamlopt, pink = ocamlc).
-j 1
The file format that is expected is an array of JSON values. This is a bit unfortunate, since it's not directly streamable (we need to emit a final "]" at the end). For this prototype, I just postprocessed a stream of JSON values using jq -s ''
. Buffering might cause some issues, because it is necessary to write from different processes. In this PoC I just flushed the buffer after each event, in practice a mutex might be enough.
There are several kind of events we can report: the one I tried are Duration Events and Complete Events. For Duration Events, a begin message and an end message are logged, with just a timestamp; for Complete Events a single message is logged with a timestamp and a duration. A challenge in the case of Duration Events is how to attach the end to the beginning. This uses the PID & TID arguments (process & thread numbers). An end event will always be attached to the most recent event with the same PID & TID. In other words, these cannot be nested.
For forked processes, a first idea is to emit events with the PID of the new process. In practice, that does not work too well because PIDs are not reused, and a timeline is generated per PID. So this creates an unreadable report. Instead, for the above graphs I tracked the duration by hand and emitted a Complete Event.
There are other events that might be relevant: for example, Async Events take an extra id managed by the application, which makes it possible to nest events. So, we could have a finer grain and distinguish when the process is set up from when the fork is done, etc. Another type of interesting event is Instant Events. They are just a point in time with associated data. Say, we could use this when the rules are done computing, or when a GC cycle is triggered.
It is possible to attach a name to an event (here, I used the basename of the program being executed), or categories (e.g. "compiler"). This makes it possible to filter and aggregate events of a given type. As an example, I graphed the distribution of ocamlc
runs which can be seen in the previous figure. The last quantile is always pretty large, I'm not sure how to explain that. This does not always seem to be related to the size of the compiled code, since it's not reproductible. It might be related to GC pauses when waiting for the process maybe?
The hooks I wrote are pretty deep down in Process.run_internal
. It seems unlikely that we can do better than adding a new global to Clflags
. We'll need to find a place to close and flush the trace fd, since at_exit
would probably run several times (not sure about that). Also, it might be nice to expose other parts of the code base to this logging system.
Very nice!
Would it be possible to adapt this to benchmark other parts of the build? I am thinking in particular of the time spent reading dune
files and constructing the rules, and traversing the file system (which "explains" the time taken in a zero build).
Two extra things I forgot:
@nojb yes I think it would be one of the most interesting mysteries to solve!
IMO, it's actually not so important to focus on the performance of generating the rules right now. @diml is working on making it a lot faster for us meanwhile and hopefully this work will be ready in the medium term. But feel free to optimize away 👍
Once #1280 is done, this will indeed give us the right granularity to measure the cost of internal dune computations. Regarding at_exit
, in fact since we have control of all the code in dune we shouldn't have to rely on it.
@emillon what came of this? I know you have an external package for doing this stuff, but this doesn't really work for dune as we can't have external packages.
Nothing much. The part I was not too sure about is how to manage the stats channel, but you're right, it can be dealt with in main.ml
directly. I'll try to rebase and publish my branch.
Problem
It's tough to understand where a dune build spends most of its time. In particular, it's not clear which external commands are the most time consuming.
Solution
Dune could have a profiling mode where it would emit events for the start and termination of external commands. Eventually, dune could even emit events for other actions/events, but I think starting with external commands is fine.
Dune should emit these events according to chrome's tracing view [1]. If we do that, then we'll be able to explore this information using chrome's neat visualizations [2]. This practice is quite common and is available in build systems such as buck and shake
[1] https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/edit
[2] https://www.gamasutra.com/view/news/176420/Indepth_Using_Chrometracing_to_view_your_inline_profiling_data.php