Open WillAyd opened 4 years ago
The timeit documentation does not apply as is, see here for details: https://asv.readthedocs.io/en/stable/benchmarks.html#timing-benchmarks
On October 22, 2019 5:59:32 PM UTC, William Ayd notifications@github.com wrote:
TLDR - I think we need to cap our benchmarks at a maximum of .2 seconds. That's a long way off though, so I think should start with a cap of 1 second per benchmark
Right now we have some very long running benchmarks:
https://pandas.pydata.org/speed/pandas/#summarylist?sort=1&dir=desc
I haven't seen a definitive answer, but I think ASV leverages the builtin timeit functionality to figure out how long a given benchmark should run.
https://docs.python.org/3.7/library/timeit.html#command-line-interface
Quoting what I thinks is important:
If -n is not given, a suitable number of loops is calculated by trying successive powers of 10 until the total time is at least 0.2 seconds.
So IIUC a particular statement is executed n times (where n is a power of 10) to the point where it reaches 0.2 seconds to run, and then is repeated
repeat
times to get a reading. asv continuous would do this 4 times (2 runs for each commit being compared). In Python 3.6repeat
is 3 (we currently pin ASVs to 3.6) but in future versions that gets bumped to 5.We have a handful of benchmarks that are 20s a piece to run, so if we stick to the 3.6 timing these statements would run n=1 times repeated 3 times per benchmark session 4 sessions per continuous run. 20s 3 repeats * 4 sessions = 4 minutes for one benchmark alone
rolling.Apply.time_rolling is a serious offender here so I think can start with that. Would take community PRs to improve performance of any of these, though maybe should prioritize anything currently taking over 1 second
cc @qwhelan and @pv who may have additional insights
Thanks for the link - reading through it definitely gives more guidance.
So we if we track something that itself takes more than 10 milliseconds to run do you know the number
of times it is run within a sample
? The documentation mentions that asv selects a number
by approximation how many runs it will take to reach the sample_time
, but its not clear what happens if one run exceeds sample_time altogether
Alternately do you have thoughts here on general best practices? Right now our benchmarks are pretty slow (ex: running the groupby module alone takes over an hour)
If it takes longer that sample_time
, number = 1
. You probably want to adjust repeat
, as the default (2, 10, 20.0)
runs until 10 samples are collected or 20 seconds elapsed --- you can e.g. make the max time shorter.
@WillAyd It appears there's a few issues:
asv
can do as long as that's the case.n=1
run (24 * 20s), so it's slow and noisyasv
that includes memory addresses in run names: https://pandas.pydata.org/speed/pandas/#rolling.Apply.time_rolling?p-function=%3Cbuilt-in%20function%20sum%3E&p-function=%3Cfunction%20sum%20at%200x7f39b3ee6bf8%3E&p-function=%3Cfunction%20Apply.%3Clambda%3E%20at%200x7f399f5f0510%3E&p-window=1000&p-contructor='DataFrame'&p-raw=True&p-dtype='float'
asv
to a version that includes https://github.com/airspeed-velocity/asv/pull/771I'll submit a PR shortly that pares down the test size so each iteration runs in under a second.
I'll get the asv updated in the env running these.
On Sat, Oct 26, 2019 at 4:02 PM Christopher Whelan notifications@github.com wrote:
@WillAyd https://github.com/WillAyd It appears there's a few issues:
- The benchmark takes 20s per run, there's not much asv can do as long as that's the case.
- It's being run over 48 parameter combinations (half fast/half slow).
- These two factors mean 8 minutes for a n=1 run (24 * 20s), so it's slow and noisy
The pydata speed site is using an older version of asv that includes memory addresses in run names: https://pandas.pydata.org/speed/pandas/#rolling.Apply.time_rolling?p-function=%3Cbuilt-in%20function%20sum%3E&p-function=%3Cfunction%20sum%20at%200x7f39b3ee6bf8%3E&p-function=%3Cfunction%20Apply.%3Clambda%3E%20at%200x7f399f5f0510%3E&p-window=1000&p-contructor='DataFrame'&p-raw=True&p-dtype='float'
- This means history is being lost as they probably don't match across runs.
- You can tell because the number of plotted lines is far less than the number in the legend; those probably were all produced in a single run.
- Upgrade asv to a version that includes airspeed-velocity/asv#771 https://github.com/airspeed-velocity/asv/pull/771
I'll submit a PR shortly that pares down the test size so each iteration runs in under a second.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/29165?email_source=notifications&email_token=AAKAOIX5CXEOZYPGP622TO3QQSV5ZA5CNFSM4JDTIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKQ2DA#issuecomment-546639116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ5BH7HQQIVEXCRNYLQQSV5ZANCNFSM4JDTIMDA .
@TomAugspurger lmk if you need help with that; might not be a bad idea to refresh knowledge on that env
TLDR - I think we need to cap our benchmarks at a maximum of .2 seconds. That's a long way off though, so I think should start with a cap of 1 second per benchmark
Right now we have some very long running benchmarks:
https://pandas.pydata.org/speed/pandas/#summarylist?sort=1&dir=desc
I haven't seen a definitive answer, but I think ASV leverages the builtin timeit functionality to figure out how long a given benchmark should run.
https://docs.python.org/3.7/library/timeit.html#command-line-interface
Quoting what I thinks is important:
So IIUC a particular statement is executed n times (where n is a power of 10) to the point where it reaches 0.2 seconds to run, and then is repeated
repeat
times to get a reading. asv continuous would do this 4 times (2 runs for each commit being compared). In Python 3.6repeat
is 3 (we currently pin ASVs to 3.6) but in future versions that gets bumped to 5.We have a handful of benchmarks that are 20s a piece to run, so if we stick to the 3.6 timing these statements would run n=1 times repeated 3 times per benchmark session 4 sessions per continuous run. 20s 3 repeats * 4 sessions = 4 minutes for one benchmark alone
rolling.Apply.time_rolling is a serious offender here so I think can start with that. Would take community PRs to improve performance of any of these, though maybe should prioritize anything currently taking over 1 second
cc @qwhelan and @pv who may have additional insights