pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.93k stars 18.03k forks source link

TST: Run ASV on Travis? #15035

Closed max-sixty closed 6 years ago

max-sixty commented 7 years ago

Running ASV locally entrusts it to the pull requester, which means it gets run only in occasional circumstances (and is a bit of a burden).

Is there a way to run it on Travis, without significantly slowing down the builds? I know CircleCI has the ability to skip tests depending on the commit message - is there something similar we could do for Travis, so it only runs with an #run_asv string in the commit message (or similar)?

jreback commented 7 years ago

I think this would be possible, though I think the running time might be too long? (create 2 envs & run all tests). but I don't run the full suite very often. but yes this would be nice (and ideally we could have multiple benchmark runs say versus 0.19.0, 0.18.0) etc.

we could easily just setup another repo, like dask did recently. To make this pretty automated. (we might actually want to setup a new org, e.g. pandas-dev-benchmarks), because then the travis runs don't compete with main pandas, but that is a separate issue.

note that the actually running of the scripts is here, a set of automated scripts create an env and run it (this is the part that would go on travis).

Also someone could scour travis for tools / examples that does this kind of benchmarking.

anyone want to give a whirl?

jorisvandenbossche commented 7 years ago

I also think this should be possible, but indeed computing time may be the most important problem. Do you know how long it takes for you to run the full benchmark?

I don't think creating a separate repo for this is needed. I thought the main reason for dask to have it as a separate repo was to also include distributed benchmarks (so benchmarks not related to a single package), https://github.com/dask/dask/pull/1738. The advantage of easily including benchmarks with PRs is something we want to keep IMO. They also have a PR for making a cron job: https://github.com/dask/dask-benchmarks/pull/8

If we would have an external machine to run perf tests, https://github.com/anderspitman/autobencher could also be interesting (used by scikit-bio).

tacaswell commented 7 years ago

From observation travis runtimes can be very flaky which might greatly reduce the value of ASV results.

jreback commented 7 years ago

are there other services (e.g. CircleCI maybe) that are 'meant' for benchmarking? (as opposed to 'making' travis work for us)?

jorisvandenbossche commented 7 years ago

From observation travis runtimes can be very flaky which might greatly reduce the value of ASV results.

The question then is also if this difference is mainly between runs, or also during one run. As differences between runs is not necessarily a problem for this use case, as the benchmark always compares to master within the same travis run. But I can certainly imagine that also this can be flaky.

For full benchmark results over time, this will indeed be a problem. But for this, another option would be to have a separate machine to do this (spend some money on this, or share infrastructure with other projects, https://github.com/dask/dask-benchmarks/issues/3#issuecomment-258282051)

pv commented 7 years ago

Re: continuous benchmarking

In my experience, you get good enough benchmark stability already from the cheapest dedicated server (~ 100€/year) --- one caveat however is in that these can have crappy CPUs, which behave differently from more high-end models vs some performance benchmarks (e.g. memory bandwidth issues). It's also fairly straightforward to set up a cron job to run (eg. inside a VM / other sandboxing) on your own desktop machine. The results can be easily hosted on github etc., so the machine does not need to be publicly visible.

The stability is in practice also less important for the continuous benchmarking over time, and more important for asv continuous. The reason is that CPU performance stepping and system loads contributes low-frequency noise (variation on long time scales). This averages towards zero for continuous benchmarking, where benchmark runs are separated by a long time interval --- in contrast, the rapid measurement in asv continuous takes samples over a short time interval, and cannot average over the slow noise.

I don't know a good solution for benchmarking PRs, however. The benchmark suites often take too long to run for Travis, and the results are too unreliable.

TomAugspurger commented 6 years ago

Closing this since we have a dedicated machine for this.