sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.31k stars 449 forks source link

Track performance regressions in CI #25262

Open saraedum opened 6 years ago

saraedum commented 6 years ago

I am currently playing with airspeed velocity to track speed regressions in Sage. I would like to benchmark every doctest that has a long time or benchmark marker in it and also benchmark every method that has a time_ prefix (probably only in some benchmark module.)

We have something similar set up for https://github.com/MCLF/mclf/tree/master/mclf/benchmarks now. There are only two benchmarks but it works nicely.

I ran the above proposal for all the tags from 8.3.beta0 to 8.3. There's a lot of noise (because there was other activity on the machine) but you get the idea: https://saraedum.github.io/sage/

Another interesting demo of airspeedvelocity that is not related to Sage is here: https://pv.github.io/numpy-bench/#/regressions

Depends on #24655

CC: @roed314 @embray @nthiery @koffie @videlec

Component: doctest framework

Keywords: ContinuousIntegration

Work Issues: documentation, doctests, CI

Author: Julian Rüth

Branch/Commit: public/airspeed_velo @ 68869ae

Issue created by migration from https://trac.sagemath.org/ticket/25262

saraedum commented 6 years ago
comment:1

I think we have to work with the Advanced API (https://docs.python.org/2/library/doctest.html#advanced-api) and hook into DocTestRunner.run() to track timings and export them into an artitfical benchmark/ directory that just prints these timings for asv.

roed314 commented 6 years ago
comment:2

This is great, and I'm happy to help!

We're already using the advanced API. See sage/doctest/forker.py, lines 425 to 786 (maybe it would make sense to do the exporting in summarize).

jdemeyer commented 6 years ago
comment:3

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

jdemeyer commented 6 years ago
comment:4

Duplicate of #12720.

saraedum commented 6 years ago
comment:5

I don't think this is a duplicate. This is about integrating speed regression checks into CI (GitLab CI, CircleCI.) Please reopen.

saraedum commented 6 years ago
comment:7

Replying to @jdemeyer:

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

That's what airspeed velocity is good for.

embray commented 6 years ago
comment:8

I am currently playing with airspeed velocity to track speed regressions in Sage

Great! It's an excellent tool and I've wanted to see it used for Sage for a long time, but wasn't sure where to begin. In case it helps I know and have worked with its creator personally.

embray commented 6 years ago
comment:9

Even if #12720 was addressing a similar problem, it's an orthogonal approach, and if Julian can get ASV working this could supersede #12720.

jdemeyer commented 6 years ago
comment:10

Replying to @saraedum:

Replying to @jdemeyer:

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

That's what airspeed velocity is good for.

Well, I'd love to be proven wrong. I thought it was just a tool to benchmark a given set of commands across versions and display fancy graphs.

embray commented 6 years ago
comment:12

Not just across versions but across commits, even (though I think you can change the granularity). Here are Astropy's ASV benchmarks: http://www.astropy.org/astropy-benchmarks/

There are numerous benchmark tests for various common and/or time-critical operations. For example, we can track how coordinate transformations perform over time (which is one example of complex code that can fairly easily be thrown into bad performance by just a few small changes somewhere).

videlec commented 6 years ago
comment:14

update milestone 8.3 -> 8.4

saraedum commented 6 years ago

Author: Julian Rüth

saraedum commented 6 years ago
comment:15

Adding this to all doctests is probably hard and would require too much hacking on asv. It's probably best to use the tool as it was intended to be used. Once #24655 is in, I would like to setup a prototype within Sage. Any area that you would like to have benchmarked from the start?

saraedum commented 6 years ago

Description changed:

--- 
+++ 
@@ -1,3 +1,5 @@
-I am currently playing with airspeed velocity to track speed regressions in Sage. For a start I would like to track the runtime of our doctests per method and see if that is already useful information.
+I am currently playing with airspeed velocity to track speed regressions in Sage. ~~For a start I would like to track the runtime of our doctests per method and see if that is already useful information.~~ We could have specially named methods, say starting in `_benchmark_time_…` that are automatically tracked by CI on the develop branch.

-A demo is here: https://pv.github.io/numpy-bench/#/regressions
+We have this set up for https://github.com/MCLF/mclf/tree/master/mclf/benchmarks now. There are only two benchmarks but it works nicely.
+
+A more interesting demo that is not related to Sage is here: https://pv.github.io/numpy-bench/#/regressions
jdemeyer commented 6 years ago
comment:16

Replying to @saraedum:

Any area that you would like to have benchmarked from the start?

This is the "hard part" that I mentioned in [comment:3]. Ideally, we shouldn't have to guess where regressions might occur, the tool would do that for us. I believe that the intention of #12720 was to integrate this in the doctest framework such that all(?) doctests would also be regression tests.

But that's probably not feasible, so here is a more productive answer:

  1. All # long time tests should definitely be regression tests.

  2. For each Parent (more precisely: every time that a TestSuite appears in a doctest): test creating a parent, test creating elements, test some basic arithmetic (also with elements of different such that we check the coercion model too).

jdemeyer commented 6 years ago

Replying to @saraedum:

We could have specially named methods, say starting in _benchmark_time_…

Adding a new method for each regression tests sounds quite heavy. Could it be possible to integrate this in doctests instead? I would love to do

EXAMPLES::

    sage: some_sage_code()  # airspeed
embray commented 6 years ago
comment:18

Replying to @saraedum:

Adding this to all doctests is probably hard and would require too much hacking on asv. It's probably best to use the tool as it was intended to be used. Once #24655 is in, I would like to setup a prototype within Sage. Any area that you would like to have benchmarked from the start?

I didn't realize you were trying to do that. And yeah, I think benchmarking every test would be overkill and would produce too much noise to be useful. Better to write specific benchmark tests, and also add new ones as regression tests whenever some major performance regression is noticed.

koffie commented 6 years ago
comment:19

Replying to @saraedum:

Adding this to all doctests is probably hard and would require too much hacking on asv. It's probably best to use the tool as it was intended to be used. Once #24655 is in, I would like to setup a prototype within Sage. Any area that you would like to have benchmarked from the start?

I can't imagine why you would like to start in on place, but if it really makes your life easier I would start with linear algebra. This is so abundant in other parts of sage that any regression there will very likely show up in other places.

saraedum commented 6 years ago

Branch: u/saraedum/25262

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 6 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

f7f3847Faster benchmark discovery during run
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 6 years ago

Commit: f7f3847

nthiery commented 6 years ago
comment:22

And yeah, I think benchmarking every test would be overkill and would produce too much noise to be useful.

I could imagine situation where I would be curious to know how the speed of a given doctest (granularity to be discussed) has evolved over time. Or where I would like to investigate how this or that (collection of) doctest was impacted by this or that ticket.

So even though having info about all doctests would indeed pollute the main "speed regression" report it could till be interesting to harvest it, and make it available with some search mechanism.

Of course this is just "good to have", if not too costly to implement/produce/store/serve.

saraedum commented 6 years ago
comment:23

So, I now ran benchmarks for all doctests that contain a "long time" marker. I tried to run it for all the tags between 8.2 and 8.3 which took about 48h on my laptop. Somehow it failed for 8.2 itself which makes the results not terribly useful and also there's a lot of noise which is likely because well I was using my computer to do other stuff as well.

Anyway, you can see the result here: https://saraedum.github.io/sage/

So, what do you think? Should we try to run time_* methods (the default behaviour of airspeedvelocity) and also all doctests that say long time?

saraedum commented 6 years ago
comment:24

Btw., the naming of the benchmarks is a bit unfortunate currently as you can only see the module and the method name but not the class which makes it a bit hard to track down which __init__ exactly saw a regression.

saraedum commented 6 years ago

Description changed:

--- 
+++ 
@@ -1,5 +1,7 @@
-I am currently playing with airspeed velocity to track speed regressions in Sage. ~~For a start I would like to track the runtime of our doctests per method and see if that is already useful information.~~ We could have specially named methods, say starting in `_benchmark_time_…` that are automatically tracked by CI on the develop branch.
+I am currently playing with airspeed velocity to track speed regressions in Sage. I would like to benchmark every doctest that has a "long time" marker in it and also benchmark every method that has a `time_` prefix (probably only in some benchmark module.)

-We have this set up for https://github.com/MCLF/mclf/tree/master/mclf/benchmarks now. There are only two benchmarks but it works nicely.
+We have something similar set up for https://github.com/MCLF/mclf/tree/master/mclf/benchmarks now. There are only two benchmarks but it works nicely.

-A more interesting demo that is not related to Sage is here: https://pv.github.io/numpy-bench/#/regressions
+I ran the above proposal for all the tags from 8.3.beta0 to 8.3. There's a lot of noise (because there was other activity on the machine) but you get the idea: https://saraedum.github.io/sage/
+
+Another interesting demo of airspeedvelocity that is not related to Sage is here: https://pv.github.io/numpy-bench/#/regressions
embray commented 6 years ago
comment:25

Why did it take 48 hours do you think? That seems a bit excessive.

embray commented 6 years ago
comment:26

Replying to @saraedum:

Btw., the naming of the benchmarks is a bit unfortunate currently as you can only see the module and the method name but not the class which makes it a bit hard to track down which __init__ exactly saw a regression.

That seems fixable. Is that an ASV bug or something on our end? Also, I see a few tests with ? in the name for which there's no graph shown. Could that be from nested classes or something?

embray commented 6 years ago
comment:27

Replying to @saraedum:

also there's a lot of noise which is likely because well I was using my computer to do other stuff as well.

If we can get several machines (even just 2) providing benchmark results this sort of problem will be mitigated. For example, here's a benchmark from Astropy for which we have results from 2 machines: http://www.astropy.org/astropy-benchmarks/#coordinates.FrameBenchmarks.time_init_array You can clearly see when major deviations are correlated.

saraedum commented 6 years ago
comment:28

Replying to @embray:

Why did it take 48 hours do you think? That seems a bit excessive.

I did a make build && sage -ba for I think 10 tags (as sage -b sometimes missed some files.) Then asv runs the doctests single-threaded and there is a few seconds of overhead for every benchmark.

Actually "about 48h" is incorrect. It's probably more than 24h and less than 48h. But I did not really check the times.

saraedum commented 6 years ago
comment:29

Replying to @embray:

Replying to @saraedum:

Btw., the naming of the benchmarks is a bit unfortunate currently as you can only see the module and the method name but not the class which makes it a bit hard to track down which __init__ exactly saw a regression.

That seems fixable. Is that an ASV bug or something on our end?

My fault.

Also, I see a few tests with ? in the name for which there's no graph shown. Could that be from nested classes or something?

I have not looked into these. Some of the empty graphs are because the benchmark timed out after 60s. That could be changed of course.

saraedum commented 6 years ago
comment:30

Replying to @saraedum:

Replying to @embray:

Replying to @saraedum: Also, I see a few tests with ? in the name for which there's no graph shown. Could that be from nested classes or something?

I have not looked into these. Some of the empty graphs are because the benchmark timed out after 60s. That could be changed of course.

No, actually the ? failed because that's not a valid method name anymore. So, also my fault ;)

embray commented 6 years ago
comment:31

Replying to @embray:

Why did it take 48 hours do you think? That seems a bit excessive.

I guess, now that I think about it, if you were doing each release between 8.2 and 8.3 you also had to do incremental builds of each of those, which could take some time as well, especially if it was just on your laptop, which you were also doing other work on.

Normally this wouldn't be a problem for machines generating new benchmark results between each release.

embray commented 6 years ago
comment:32

My other comments aside, this looks great!

I don't think it's too much noise. In general, most people will only be looking at benchmarks for areas of the code that they're particularly concerned about, though it would also be good to occasionally scan for any major regressions. It will also be good if we can get better looking display names on each of the benchmarks. If there's something we need to patch upstream in ASV to improve that I'm sure we could. It will also obviously be more useful if there are multiple machines providing benchmarks. Perhaps we could integrate this into the buildbot builds.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 6 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

d7ff532A proof of concept of airspeed velocity integration
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 6 years ago

Changed commit from f7f3847 to d7ff532

embray commented 6 years ago
comment:34

Also, we could group benchmarks by package like this: http://www.astropy.org/astropy-benchmarks/#/

I bet with a bit of hacking we could even get the mouseover that shows the test to actually display the relevant doctest instead. If nothing else, this could be done by generating functions that have the relevant doctest in its docstring :)

saraedum commented 6 years ago
comment:35

Replying to @embray:

It will also be good if we can get better looking display names on each of the benchmarks. If there's something we need to patch upstream in ASV to improve that I'm sure we could.

We could improve it somewhat but I would like to have a hash in there somewhere and it also has to be a valid Python2 method name.

This is really some black metaclass magic to have asv detect one benchmark method for every doctest we have. Let me try to explain how this goes roughly.

ASV first runs a "discovery" where it collects all the benchmarks (by looking for methods that start with certain prefixes such as time_.) Then it invokes itself once for each such method to run the actual benchmark.

To trick the discovery into finding a method for every doctest, I implement __dir__ on the BenchmarkMetaclass to run all doctests in the Sage library but without actually running the code. Instead I just print the expected output to have them all pass as quickly as possible and track their existence. Here, I create a hash of the doctest so I can find it again in the second pass of ASV.

Then when ASV actually tries to run, say __init__.Benchmarks.track__families__RingedTree__62b2284259a21f85bd8db00b8522ad3b, I inject that method in __getattr__ and have it run all the doctests in **/families*.pyx? but again I actually skip all the doctests except for the first one that produces 62b2284259a21f85bd8db00b8522ad3b as its hash. I time every line of that doctest and return these timings as the result to ASV.

embray commented 6 years ago
comment:36

Replying to @embray:

Also, we could group benchmarks by package like this: http://www.astropy.org/astropy-benchmarks/#/

I bet with a bit of hacking we could even get the mouseover that shows the test to actually display the relevant doctest instead. If nothing else, this could be done by generating functions that have the relevant doctest in its docstring :)

Maybe nothing even that complicated. ISTM we can subclass the Benchmark base class (or more specifically ones like TimeBenchmark and assign the code attribute to whatever we want, which could include the relevant doctest snippet.

saraedum commented 6 years ago
comment:37

Replying to @embray:

Also, we could group benchmarks by package like this: http://www.astropy.org/astropy-benchmarks/#/

That one is a bit tricky to do dynamically. ASV detects packages by looking at the filesystem so you would actually need .py files there. And I would like to have something that works not only on the benchmark machines but also if you invoke asv yourself.

I bet with a bit of hacking we could even get the mouseover that shows the test to actually display the relevant doctest instead. If nothing else, this could be done by generating functions that have the relevant doctest in its docstring :)

Sure, you would just have to set the docstring of the generated methods.

embray commented 6 years ago
comment:38

Replying to @saraedum:

Replying to @embray:

Also, we could group benchmarks by package like this: http://www.astropy.org/astropy-benchmarks/#/

That one is a bit tricky to do dynamically. ASV detects packages by looking at the filesystem so you would actually need .py files there. And I would like to have something that works not only on the benchmark machines but also if you invoke asv yourself.

I think all of that can, and should be customizable. I also don't believe we should jump through hoops just to generate benchmark instances.

ASV does have the underlying infrastructure for a plugin interface, though unfortunately not much of it is actually implemented yet so as to be useful (much less documented). But I think if there's anything we need to customize in ASV we should do that. There are some things we can do through the API, and we can already do some things through the plugin system via monkey-patching but that's obviously not ideal. Anything else we might need, I think I can get upstream easily enough.

TL;DR this hack is great for demonstration purposes. But let's think about we need/want to generate benchmarks directly from Sage's doctest collector and go from there, rather than assume we need to be constrained by ASV's existing design.

koffie commented 6 years ago
comment:39

By the way I just looked around a bit and it doesn't always render as nice on my laptop. See the partially hidden version numbers at https://www.dropbox.com/s/fio5b4ndj0jh2oz/view.jpeg?dl=0 . It happens both on Chrome and in Safari on OS X.

koffie commented 6 years ago
comment:40

Ok, I just found out it can be solved by making my browser window wide enough so that the text _init__.Benchmarks.track__abvar____add____a96f4ce23e82a785e765cee87e42e623 on the same line as Benchmark grid Benchmark list Regressions. So I guess it is not that important so sorry for the noise.

embray commented 6 years ago
comment:41

I've been poking around a bit more in the ASV source, and am going to see what I can come up with.

  1. There is a base Benchmark class that I believe we can customize just a little bit for our purposes, and the rest of the code is flexible enough that we should be able to add our custom Benchmark class to the list of known benchmark types (asv.benchmark.benchmark_types). Ideally a plugin would be able to do this without modifying any internal data structures. (In fact, now I'm thinking we may not even need any Benchmark subclasses if our custom discovery code is clever enough...)

  2. The main thing, then, that we need to customize is benchmark discovery. Currently there's no great way to do this and this is where another plugin interface is needed. The current plugin discovery process ultimately yields Benchmark instances which contain all the information needed for a single benchmark test (it also wraps the benchmark function itself--in this case we can either generate a function from the doctest, or use a standard function for running a single doctest). What we need then is a way to extend the benchmark discovery process to allow discovery from arbitrary sources (rather than just searching the file system for .py files and importing them).

  3. Relatedly, there is a function asv.benchmark.get_benchmark_from_name which resolves a unique benchmark name to the relevant test. A plugin needs to be able to extend how benchmarks are searched for by name.

saraedum commented 6 years ago
comment:42

Replying to @embray:

I've been poking around a bit more in the ASV source, and am going to see what I can come up with.

  1. There is a base Benchmark class that I believe we can customize just a little bit for our purposes, and the rest of the code is flexible enough that we should be able to add our custom Benchmark class to the list of known benchmark types (asv.benchmark.benchmark_types). Ideally a plugin would be able to do this without modifying any internal data structures. (In fact, now I'm thinking we may not even need any Benchmark subclasses if our custom discovery code is clever enough...)

  2. The main thing, then, that we need to customize is benchmark discovery. Currently there's no great way to do this and this is where another plugin interface is needed. The current plugin discovery process ultimately yields Benchmark instances which contain all the information needed for a single benchmark test (it also wraps the benchmark function itself--in this case we can either generate a function from the doctest, or use a standard function for running a single doctest). What we need then is a way to extend the benchmark discovery process to allow discovery from arbitrary sources (rather than just searching the file system for .py files and importing them).

  3. Relatedly, there is a function asv.benchmark.get_benchmark_from_name which resolves a unique benchmark name to the relevant test. A plugin needs to be able to extend how benchmarks are searched for by name.

Sure, if you want to change that in ASV that would be quite nice. I don't want to hack on ASV itself, so I'd rather try to get the current version into a slightly more reasonable state and start benchmarking automatically through one of the CIs. This doesn't need to be merged into Sage for that necessarily. Once the modified ASV is ready, we can change the benchmarks to be less of a hack.

jdemeyer commented 6 years ago
comment:43

Replying to @saraedum:

So, what do you think? Should we try to run time_* methods

I don't like this part because it doesn't mix well with doctests. I would really want to write a doctest like

sage: a = something
sage: b = otherthing
sage: c = computation(a, b)  # benchmark this

and being forced to wrap this in a time_ method is just ugly.

saraedum commented 6 years ago
comment:44

embray: https://github.com/airspeed-velocity/asv/issues/481 might be related (though more limited in scope) which talks about customizing benchmark discovery.

saraedum commented 6 years ago
comment:45

Replying to @jdemeyer:

Replying to @saraedum:

So, what do you think? Should we try to run time_* methods

I don't like this part because it doesn't mix well with doctests. I would really want to write a doctest like

sage: a = something
sage: b = otherthing
sage: c = computation(a, b)  # benchmark this

and being forced to wrap this in a time_ method is just ugly.

I see. I think it would be easy to track lines that say, e.g., # benchmark time separately. I am not sure if it's a good idea to add more magic comments to our doctesting. I've nothing against them in general, I am just worried that these features are relatively obscure so not many people are going to use them?

Let me try to start with the benchmarking of blocks that say # long time and add more features later.

nthiery commented 6 years ago
comment:46

Just two cents without having though two much about it.

I like the # benchmark approach too. It mixes well with how we write doctests and makes it trivial to create new benchmarks / annotate things as useful to benchmark.

I'd rather have a different annotation than # long time; otherwise devs will have to take a decision between benchmarking and running the tests always, not just with --long.

Of course, at this stage using # long time is a good way for experimenting. And it may be reasonable to keep benchmarking # long time lines later on.

Thanks!

embray commented 6 years ago
comment:47

Replying to @jdemeyer:

Replying to @saraedum:

So, what do you think? Should we try to run time_* methods

I don't like this part because it doesn't mix well with doctests. I would really want to write a doctest like

sage: a = something
sage: b = otherthing
sage: c = computation(a, b)  # benchmark this

and being forced to wrap this in a time_ method is just ugly.

Yes, something like that could be done. Again, it all comes down to providing a different benchmark discovery plugin for ASV. For discovering benchmarks in our doctest, all lines leading up to a # benchmark line could be considered setup code, with the # benchmark line being the one actually benchmarked (obviously).

Multiple # benchmark tests in the same file would work fine too, with every line prior to it (including other previously benchmarked lines) considered the setup for it.

It might be trickier to do this in such a way that avoids duplication but I'll think about that. I think it could still be done.

mantepse commented 6 years ago
comment:48

I think that this is wonderful.

Since I tried to improve performance of certain things recently, and will likely continue to do so, I would like to add doctests for speed regression already now. Should I use long or benchmark or something else?