Open Serge3leo opened 2 months ago
Hi @Serge3leo ,
thanks for the bug report! If I understand your proposed changes correctly, then you have:
os=0
, where you first bin the results and then reduce them in a second step.The speed improvements are certainly impressive! I would conjecture that this is less because of the "unrolling" and instead because you are reducing the data dependency in the sum (so it can parallelize the whole thing). We could certainly add that special case!
Are you compiling this with some extra flags (say, -march=native
)?
Configuration: xprec-1.4.2 on Intel macOS Sonoma Benchmark:
ndarray[ddouble]
+ndarray[float64]
;ndarray[ddouble]
+=ndarray[float64]
;ndarray[float64]
asddouble
The profiler shows
u_added()
as a problem.For an experiment to evaluate possible performance improvements, I made the following changes:
Preliminary assessment of the change
What do you think about this problem and possible changes?