Closed EwoutH closed 1 year ago
Should be easy to replace the data with a smaller dataset.
Performance profile:
Looks like matplotlib's autoscale_view
is called a lot of times, which seems to be the most expensive operation.
For the test itself, smaller dataset would indeed work.
No idea why autoscale_view is called so often. There are 63 calls to plot_individual_cdf, so that is close 400 calls to autoscale per cdf (this is not the only place where autoscale might be called, but likely the dominant one). It's also strange to autoscale because at least the x-axis has a clearly defined hard-coded limit. It seems this arises from somewhere deep within matplotlib.
Here the tree:
It seems that from plot_discrete_cdf
(called 6 times) inner
is the first function that's called that 24024 times, all the way to autoscale_view
.
The calls to scatter in plot_discrete_cdf
might be redundant and could be replaced by using marker in the plot command. Would need some checking, however.
Just ran a quick test. With scatter enabled timeit gives
52 s ± 732 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With scatter disabled, and replaced with the marker kwarg in plot we get
621 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, disabling scatter gives us orders of magnitude improvement, and no discernable difference in the visual.
That's 2 orders of magnitude, or a 99% reduction in runtime. Seems worth it.
Are there any figures where the scatter is likely required to provide a correct/useful visual? If so we should test those.
I did a visual comparison. It is really marginal.
Then the speedup sounds worth it!
This test takes extremely long, in the order of minutes. I wouldn't be surprised if it's over half of all testing time.
https://github.com/quaquel/EMAworkbench/blob/b76b487afc5fbaaa34e5a055269657ff15fa908e/test/test_analysis/test_regional_sa.py#L14-L23