Open dhil opened 3 years ago
I like this idea. My only concern is how noisy we expect the data to be:
Just to get an idea of what kind of noise we may be seeing if we used Travis bots to run the benchmarks, I looked at the overall time spent running the entire CI as well as the time spend running the Links tests (make tests
).
The Travis build times of the last couple of months on master (nicely shown on https://travis-ci.org/github/links-lang/links/builds) seem to be in a range of 17 - 22 minutes, with the very last one (https://travis-ci.org/github/links-lang/links/builds/768117701) taking 24.
There seems to be a lot of variance in the time spend just running make tests
(which should not be affected by things like network and disk IO, which may affect the overall time spend running the CI, given that we spend most of the time building ocaml + opam packages): We are seeing durations at least in a range from 217s (https://travis-ci.org/github/links-lang/links/builds/756232009) to 304s (https://travis-ci.org/github/links-lang/links/builds/768117701), just to randomly pick two data points.
I guess that still allows us to catch the worst regressions, but for something like a 25% regression we may not be able to trigger an automated alert, but we would have to look at graphs showing a longer-term trend (or trigger alerts based on trends across multiple commits). But of course, being able to catch some regressions is better than catching none of them!
I think as a first step, I would just like to see some graphs. I think we can fairly easily configure GHA to run a performance analysis job. It would be good to think about some metrics that we would be interested in, e.g. interpreter runtime performance, type inference performance, IR type checking performance, size of the Links binary, etc. Then we can pick one or two metrics to configure an initial prototype.
Needless to say, performance has not been our primary concern in the past. Nevertheless, I think it is good to at least adopt some performance-awareness such that we do not inadvertently regress the performance of Links. There is at least evidence of performance regressions and hypothetical improvements in the past, e.g. #35, #228, #359, #650, #228. Alas, we never really follow up on them in a systematic way. It would be good to track the performance progression of Links on a regular basis.
What I have in mind is to set up a bot to build and benchmark Links whenever a change is committed
master
branch. We can also consider benchmarking changes to pull requests. If we store benchmark data in some database, then we can track performance progression over time. We can set up a small webpage to present the collected data in various ways, e.g. sequence-charts such as the one below (courtesy of Computer Language Benchmark Game).We can also consider to get the bot to alert us about "severe" performance regressions by automatically opening an issue pointing to the empirical evidence.
There are multiple dimensions along which we can benchmark Links, however, at the very least I think we ought to test the time and space performance of the server-side interpreter and the JavaScript compiler. Benchmarking interactions between the two would be interesting, though.
Automatic performance progression tracking could also be a valuable tool for when we start thinking more seriously about optimisations.
I reckon we could get something going very easily with some Python hackery. It may make a good student or intern project.