rgrinberg commented 6 years ago

Problem

It's tough to understand where a dune build spends most of its time. In particular, it's not clear which external commands are the most time consuming.

Solution

Dune could have a profiling mode where it would emit events for the start and termination of external commands. Eventually, dune could even emit events for other actions/events, but I think starting with external commands is fine.

Dune should emit these events according to chrome's tracing view [1]. If we do that, then we'll be able to explore this information using chrome's neat visualizations [2]. This practice is quite common and is available in build systems such as buck and shake

[1] https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/edit

[2] https://www.gamasutra.com/view/news/176420/Indepth_Using_Chrometracing_to_view_your_inline_profiling_data.php

avsm commented 6 years ago

Definitely would be useful for me!

ghost commented 6 years ago

I'm for this as well!

emillon commented 6 years ago

A low effort version of that work can be done non-intrusively using perf:

speedscope

perf record -g --call-graph=dwarf -F 99 dune build
perf script | /path/to/stackcollapse-perf.pl | ~/.yarn/bin/speedscope -

(not sure what's going on with these sh processes...)

ghost commented 6 years ago

Wow, the amount of time spent in __libc_fork is a bit scary. I wonder if we should try to vendor spawn

ghost commented 6 years ago

@emillon it's because the compiler uses Sys.system to call out to external commands such as gcc, as, ar, ...

emillon commented 6 years ago

I'm not sure how precise this graph is, below the first row it might be collapsing things that it shouldn't be (ie, possibly all subprocesses will contribute to that __libc_fork part)

it's because the compiler uses Sys.system

right, I only grepped for "sh" uses from within dune, but system does use it. Once again, there might be double counts with the other tools. I'll try to collapse the stacktraces by hand to have only the processes.

ghost commented 6 years ago

I see. What would be interesting is to see the accumulated amount of time spent in the call to Unix.create_process in process.ml

emillon commented 6 years ago

I wrote a quick implementation hooking into process.ml.

:framed_picture: First, some beautiful pictures:

(you can see the command names more clearly in the linked images, but roughly purple = gcc, blue = ocamlopt, pink = ocamlc).

Compiling the ocaml platform with `-j 1`

Compiling the ocaml platform with default options

default

Distribution of ocamlc durations

ocamlc

:wrench: Technical details

The file format that is expected is an array of JSON values. This is a bit unfortunate, since it's not directly streamable (we need to emit a final "]" at the end). For this prototype, I just postprocessed a stream of JSON values using jq -s ''. Buffering might cause some issues, because it is necessary to write from different processes. In this PoC I just flushed the buffer after each event, in practice a mutex might be enough.

There are several kind of events we can report: the one I tried are Duration Events and Complete Events. For Duration Events, a begin message and an end message are logged, with just a timestamp; for Complete Events a single message is logged with a timestamp and a duration. A challenge in the case of Duration Events is how to attach the end to the beginning. This uses the PID & TID arguments (process & thread numbers). An end event will always be attached to the most recent event with the same PID & TID. In other words, these cannot be nested.

For forked processes, a first idea is to emit events with the PID of the new process. In practice, that does not work too well because PIDs are not reused, and a timeline is generated per PID. So this creates an unreadable report. Instead, for the above graphs I tracked the duration by hand and emitted a Complete Event.

There are other events that might be relevant: for example, Async Events take an extra id managed by the application, which makes it possible to nest events. So, we could have a finer grain and distinguish when the process is set up from when the fork is done, etc. Another type of interesting event is Instant Events. They are just a point in time with associated data. Say, we could use this when the rules are done computing, or when a GC cycle is triggered.

It is possible to attach a name to an event (here, I used the basename of the program being executed), or categories (e.g. "compiler"). This makes it possible to filter and aggregate events of a given type. As an example, I graphed the distribution of ocamlc runs which can be seen in the previous figure. The last quantile is always pretty large, I'm not sure how to explain that. This does not always seem to be related to the size of the compiled code, since it's not reproductible. It might be related to GC pauses when waiting for the process maybe?

:ship: How to implement that?

The hooks I wrote are pretty deep down in Process.run_internal. It seems unlikely that we can do better than adding a new global to Clflags. We'll need to find a place to close and flush the trace fd, since at_exit would probably run several times (not sure about that). Also, it might be nice to expose other parts of the code base to this logging system.

nojb commented 6 years ago

Very nice!

Would it be possible to adapt this to benchmark other parts of the build? I am thinking in particular of the time spent reading dune files and constructing the rules, and traversing the file system (which "explains" the time taken in a zero build).

emillon commented 6 years ago

Two extra things I forgot:

colors: we don't have a lot of choices. They are picked from a fixed list which is not very large.
how to read the graphs: since I attached everything to the same PID, the layout is not relevant, unlike a flamegraph for example. I guess that the processes are ordered so that the longer ones are on top.

emillon commented 6 years ago

@nojb yes I think it would be one of the most interesting mysteries to solve!

rgrinberg commented 6 years ago

IMO, it's actually not so important to focus on the performance of generating the rules right now. @diml is working on making it a lot faster for us meanwhile and hopefully this work will be ready in the medium term. But feel free to optimize away 👍

ghost commented 6 years ago

Once #1280 is done, this will indeed give us the right granularity to measure the cost of internal dune computations. Regarding at_exit, in fact since we have control of all the code in dune we shouldn't have to rely on it.

rgrinberg commented 5 years ago

@emillon what came of this? I know you have an external package for doing this stuff, but this doesn't really work for dune as we can't have external packages.

emillon commented 5 years ago

Nothing much. The part I was not too sure about is how to manage the stats channel, but you're right, it can be dealt with in main.ml directly. I'll try to rebase and publish my branch.

ocaml / dune

[IDEA] Use Chrome's tracing view to profile dune builds #1180

Problem

Solution

:framed_picture: First, some beautiful pictures:

Compiling the ocaml platform with `-j 1`

Compiling the ocaml platform with default options

Distribution of ocamlc durations

:wrench: Technical details

:ship: How to implement that?

ocaml / dune

[IDEA] Use Chrome's tracing view to profile dune builds #1180

Problem

Solution

:framed_picture: First, some beautiful pictures:

Compiling the ocaml platform with -j 1

Compiling the ocaml platform with default options

Distribution of ocamlc durations

:wrench: Technical details

:ship: How to implement that?

Compiling the ocaml platform with `-j 1`