ASV Benchmark - Time Standards

WillAyd commented 4 years ago

TLDR - I think we need to cap our benchmarks at a maximum of .2 seconds. That's a long way off though, so I think should start with a cap of 1 second per benchmark

Right now we have some very long running benchmarks:

https://pandas.pydata.org/speed/pandas/#summarylist?sort=1&dir=desc

I haven't seen a definitive answer, but I think ASV leverages the builtin timeit functionality to figure out how long a given benchmark should run.

https://docs.python.org/3.7/library/timeit.html#command-line-interface

Quoting what I thinks is important:

If -n is not given, a suitable number of loops is calculated by trying successive powers of 10 until the total time is at least 0.2 seconds.

So IIUC a particular statement is executed n times (where n is a power of 10) to the point where it reaches 0.2 seconds to run, and then is repeated repeat times to get a reading. asv continuous would do this 4 times (2 runs for each commit being compared). In Python 3.6 repeat is 3 (we currently pin ASVs to 3.6) but in future versions that gets bumped to 5.

We have a handful of benchmarks that are 20s a piece to run, so if we stick to the 3.6 timing these statements would run n=1 times repeated 3 times per benchmark session 4 sessions per continuous run. 20s 3 repeats * 4 sessions = 4 minutes for one benchmark alone

rolling.Apply.time_rolling is a serious offender here so I think can start with that. Would take community PRs to improve performance of any of these, though maybe should prioritize anything currently taking over 1 second

cc @qwhelan and @pv who may have additional insights

pv commented 4 years ago

The timeit documentation does not apply as is, see here for details: https://asv.readthedocs.io/en/stable/benchmarks.html#timing-benchmarks

On October 22, 2019 5:59:32 PM UTC, William Ayd notifications@github.com wrote:

TLDR - I think we need to cap our benchmarks at a maximum of .2 seconds. That's a long way off though, so I think should start with a cap of 1 second per benchmark

Right now we have some very long running benchmarks:

https://pandas.pydata.org/speed/pandas/#summarylist?sort=1&dir=desc

I haven't seen a definitive answer, but I think ASV leverages the builtin timeit functionality to figure out how long a given benchmark should run.

https://docs.python.org/3.7/library/timeit.html#command-line-interface

Quoting what I thinks is important:

If -n is not given, a suitable number of loops is calculated by trying successive powers of 10 until the total time is at least 0.2 seconds.

So IIUC a particular statement is executed n times (where n is a power of 10) to the point where it reaches 0.2 seconds to run, and then is repeated repeat times to get a reading. asv continuous would do this 4 times (2 runs for each commit being compared). In Python 3.6 repeat is 3 (we currently pin ASVs to 3.6) but in future versions that gets bumped to 5.

We have a handful of benchmarks that are 20s a piece to run, so if we stick to the 3.6 timing these statements would run n=1 times repeated 3 times per benchmark session 4 sessions per continuous run. 20s 3 repeats * 4 sessions = 4 minutes for one benchmark alone

rolling.Apply.time_rolling is a serious offender here so I think can start with that. Would take community PRs to improve performance of any of these, though maybe should prioritize anything currently taking over 1 second

cc @qwhelan and @pv who may have additional insights

WillAyd commented 4 years ago

Thanks for the link - reading through it definitely gives more guidance.

So we if we track something that itself takes more than 10 milliseconds to run do you know the number of times it is run within a sample? The documentation mentions that asv selects a number by approximation how many runs it will take to reach the sample_time, but its not clear what happens if one run exceeds sample_time altogether

Alternately do you have thoughts here on general best practices? Right now our benchmarks are pretty slow (ex: running the groupby module alone takes over an hour)

pv commented 4 years ago

If it takes longer that sample_time, number = 1. You probably want to adjust repeat, as the default (2, 10, 20.0) runs until 10 samples are collected or 20 seconds elapsed --- you can e.g. make the max time shorter.

qwhelan commented 4 years ago

@WillAyd It appears there's a few issues:

The benchmark takes 20s per run, there's not much asv can do as long as that's the case.
It's being run over 48 parameter combinations (half fast/half slow).
These two factors mean 8 minutes for a n=1 run (24 * 20s), so it's slow and noisy
The pydata speed site is using an older version of asv that includes memory addresses in run names: https://pandas.pydata.org/speed/pandas/#rolling.Apply.time_rolling?p-function=%3Cbuilt-in%20function%20sum%3E&p-function=%3Cfunction%20sum%20at%200x7f39b3ee6bf8%3E&p-function=%3Cfunction%20Apply.%3Clambda%3E%20at%200x7f399f5f0510%3E&p-window=1000&p-contructor='DataFrame'&p-raw=True&p-dtype='float'
- This means history is being lost as they probably don't match across runs.
- You can tell because the number of plotted lines is far less than the number in the legend; those probably were all produced in a single run.
- Upgrade asv to a version that includes https://github.com/airspeed-velocity/asv/pull/771

I'll submit a PR shortly that pares down the test size so each iteration runs in under a second.

TomAugspurger commented 4 years ago

I'll get the asv updated in the env running these.

On Sat, Oct 26, 2019 at 4:02 PM Christopher Whelan notifications@github.com wrote:

@WillAyd https://github.com/WillAyd It appears there's a few issues:

The benchmark takes 20s per run, there's not much asv can do as long as that's the case.

It's being run over 48 parameter combinations (half fast/half slow).

These two factors mean 8 minutes for a n=1 run (24 * 20s), so it's slow and noisy

The pydata speed site is using an older version of asv that includes memory addresses in run names: https://pandas.pydata.org/speed/pandas/#rolling.Apply.time_rolling?p-function=%3Cbuilt-in%20function%20sum%3E&p-function=%3Cfunction%20sum%20at%200x7f39b3ee6bf8%3E&p-function=%3Cfunction%20Apply.%3Clambda%3E%20at%200x7f399f5f0510%3E&p-window=1000&p-contructor='DataFrame'&p-raw=True&p-dtype='float'

This means history is being lost as they probably don't match across runs.

You can tell because the number of plotted lines is far less than the number in the legend; those probably were all produced in a single run.

Upgrade asv to a version that includes airspeed-velocity/asv#771 https://github.com/airspeed-velocity/asv/pull/771

I'll submit a PR shortly that pares down the test size so each iteration runs in under a second.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/29165?email_source=notifications&email_token=AAKAOIX5CXEOZYPGP622TO3QQSV5ZA5CNFSM4JDTIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKQ2DA#issuecomment-546639116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ5BH7HQQIVEXCRNYLQQSV5ZANCNFSM4JDTIMDA .

WillAyd commented 4 years ago

@TomAugspurger lmk if you need help with that; might not be a bad idea to refresh knowledge on that env

pandas-dev / pandas

ASV Benchmark - Time Standards #29165