Closed cmbant closed 2 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Please upload report for BASE (
master@660e0fa
). Learn more about missing BASE report.
Really looks good to me. I did the time benchmark as suggested and I'm getting same time factors.
One remark a bit uncorrelated to this PR, I had a look into the _fast_chi_square
function of cobaya
which, in most of the time, fall down to camb.mathutils.chi_squared
(in case camb
is installed) but in case camb
is not installed the default way to compute $\chi^2$ is https://github.com/CobayaSampler/cobaya/blob/master/cobaya/likelihoods/base_classes/DataSetLikelihood.py#L27 Regarding time benchmark, we noticed sometimes ago that dot
product was slightly slower than @
product introduced in python 3.5. So we might save few microseconds in case someone use mflike
(or any likelihood relying to _fast_chi_square
) without camb
installed.
Interesting, I had assumed they mapped to identical code. If I do this in jupyter (for n=9000) if anything the .dot form seems to be faster for me
timeit v @ cov @ v
%timeit cov.dot(v).dot(v)
%timeit v @ cov @ v
%timeit cov.dot(v).dot(v)
11.4 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.4 ms ± 63.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.2 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.6 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CAMB's gets the extra factor of 2 from symmetry.
I also thought they map to the same piece of code. In your test, the cov matrix is very sparse (so the purpose of this PR). If you do the test on the same 9000x9000 matrix with random numbers, then the @
is much faster than dot
. I'm not sure I understand the difference between dot
and matmul
which corresponds to the @
implementation https://numpy.org/doc/stable/reference/generated/numpy.matmul.html#numpy.matmul
odd, I see the same with random numbers and matmul or @ (Maybe it changed in numpy 2 or depends on threading?)
At least for the moment, the weights are mostly zero since the bandpowers localized, so all those elements of the matrix multiplication are not needed. This refactor speeds it up by about a factor of about 8 on my laptop, hopefully enough to make dragging sampling useful. It could of course be further improved by putting the whole operation into numba with openmp.
Now the remaining computing time is split roughly equally between _get_foreground_model and the rest. To see where time was spent I did