Open gha3mi opened 9 months ago
Brilliant! Your project looks awesome, I'll take a look at that!
Thanks for sharing it!
Thanks! I've been thinking about a tool and a place to test Fortran fpm packages, not with the intention of competing, but with the aim of improving the packages.
That's a very good initiative, how are you thinking about proceeding? would you like PRs to centralize the Benchmarks and try to have them published with a github action?
I forked your project to try it out, I managed to get results with gfortran but I'm hitting a few dependencies issues with ifort and ifx.
I thought that having a companion sphinx-gallery would be a good way of having the plots neatly organized.
That's a very good initiative, how are you thinking about proceeding? would you like PRs to centralize the Benchmarks and try to have them published with a github action?
Yes, exactly. I think this may be the easiest way to get the results.
I forked your project to try it out, I managed to get results with gfortran but I'm hitting a few dependencies issues with ifort and ifx.
Each benchmark has an index. In your test, I noticed that the last one for fprod_kahan
has the same number, 6, as fprod
.
call bench%start_benchmark(7, 'kahan', "a = fprod_kahan(u, v)", [p]) ! here 6 -> 7
Here are the flags I used for each compiler: fpm.rsp. I used LAPACK
and BLAS
. I also ran a dot benchmarking test using GitHub Actions here. The last step to create a pull request fails; it needs some work. However, the benchmarks with gfortran
, ifort
, ifx
, and nvfortran
work. Could it be an issue with LAPACK
and BLAS
in your case?
I thought that having a companion sphinx-gallery would be a good way of having the plots neatly organized.
It looks great. However, I am not familiar with it. I will take a look at it. If you could provide it, that would be great.
! here 6 -> 7
Upsss, that was a typo, fixed.
Yes, I saw the dependencies and managed to install blas/lapack for runing with gfortran. But for intel compilers I had to add a bunch of stuff https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html but I still have other compile errors and no time to actually solve it :s ... I'll continue later on to take a look... Also, I was thinking that the benchmarks will be more interesting with -O3 instead of -Ofast, which even compiler developers do not recommend. https://fortran-lang.discourse.group/t/is-ofast-in-gfortran-or-fast-flag-in-intel-fortran-safe-to-use/2755/4
Regarding setting up a sphinx gallery connected to a project, here they have an example and describe how to connect it with a source project.
For inspiration, I always look at PyVista they have a github repo with all the sources and a secondary repo that is automatically fed https://github.com/pyvista/pyvista/tree/main/doc ... something like this should work with a ForBenchmark-doc
repo :)
Yes, I saw the dependencies and managed to install blas/lapack for runing with gfortran. But for intel compilers I had to add a bunch of stuff https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html but I still have other compile errors and no time to actually sole it :s ... I'll continue later on to take a look... Also, I was thinking that the benchmarks will be more interesting with -O3 instead of -Ofast, which even compiler developers do not recommend. https://fortran-lang.discourse.group/t/is-ofast-in-gfortran-or-fast-flag-in-intel-fortran-safe-to-use/2755/4
You can also use -qmkl instead of -llapack and -lblas. By the way, you are right; I will replace -Ofast with -O3.
Regarding setting up a sphinx gallery connected to a project, here they have an example and describe how to connect it with a source project.
For inspiration, I always look at PyVista they have a github repo with all the sources and a secondary repo that is automatically fed https://github.com/pyvista/pyvista/tree/main/doc ... something like this should work with a
ForBenchmark-doc
repo :)
Alright, I will take a look at it.
Thank you! If you find the time, you can send a pull request for the dot_product or any other implementations.
Perfect, if you get started with that, here a few dependencies that I use for sphinx projects:
pip install numpydoc pydata-sphinx-theme sphinxcontrib-bibtex jupyter_sphinx sphinx_panels pythreejs
numpydoc for the documentation style within the python scripts pydata-sphinx-theme give the white theme used by PyVista sphinxcontrib-bibtex enables adding a .bib file that can be used to cite stuff within the project à la LaTeX jupyter_sphinx sphinx_panels pythreejs for integrating jupyter-notebook like stuff
You can also use -qmkl instead of -llapack and -lblas
So this worked, I had to comment out in the fpm.toml this link = ["lapack", "blas"]
Couple of questions: How do you measure the speedup? it seems like the ratio is inverted when I look at the plots and values in the data. I didn't check where you compute it, but I would have expected something like speed-up = time_reference / time_new_method
, such that a speed-up > 1 implies faster. But this is not what I saw with the dot products.
The reference value is systematically the benchmark placed in the first place?
So this worked, I had to comment out in the fpm.toml this
link = ["lapack", "blas"]
perfect!
You can also use -qmkl instead of -llapack and -lblas
So this worked, I had to comment out in the fpm.toml this
link = ["lapack", "blas"]
Couple of questions: How do you measure the speedup? it seems like the ratio is inverted when I look at the plots and values in the data. I didn't check where you compute it, but I would have expected something like
speed-up = time_reference / time_new_method
, such that a speed-up > 1 implies faster. But this is not what I saw with the dot products.The reference value is systematically the benchmark placed in the first place?
You can find it here: link to the code. Yes, you are right; this is inverted now. Thank you for reminding me. Feel free to send a pull request (PR) if you have time, or I will change it as soon as possible.
The reference value is systematically the benchmark placed in the first place?
Yes, exactly. I tried to provide an example demo with some comments here: link to the demo.
The results change quite a bit from one run to another, for instance here with ifort and the following flags -O3 -mtune=native -xHost -qmkl -qopenmp -ipo -DINT64
just two subsequent runs of the the bench:
This is very nice and interesting, I think that from a statistical point of view is more than acceptable. I was just wondering then, how could the actual time of the function be extracted from the intermediate operations included to avoid for excessive optimization. This time changes also the ratio as ratio = (time_ref + C)/(time_i + C)
is then influenced by that constant. Maybe an internal measurement in the loop should be done to capture these lines and extract it from the time captured by the bench
object ?
Oh, just saw the label of the abscissa should be updated to method name
?
I am working on speeding up plots. I will write to you again here.
I'm wondering if something like this could help to have a clearer view:
call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
time = 0._rk !> a variable defined as time(0:1)
do nl = 1,bench%nloops
time(0) = time(0) + timer() !> a function pointer using the selected method
u = u + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
v = v + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
time(1) = time(1) + timer()
a = dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops , extract_time = time(1)-time(0) ) !> an optional variable to extract time
!! from the analysis that is not associated with the function that is being benchmarked
?
I changed speed-up plot to plot for all problem sizes:
I also tried to plot the average weighted speed-up; however, I'm not sure if this provides valuable insights:
The results change quite a bit from one run to another.
I think there are many factors such as the temperature of the CPU, other processes running during benchmarking, different random numbers,... However, I updated the code to use the same random numbers consistently.
This time changes also the ratio as
ratio = (time_ref + C)/(time_i + C)
is then influenced by that constant.
I noticed this before. But if you measure this time, you need to calculate the time of the timer function again! In my opinion, for large problem sizes, this could be simply avoided. Or maybe calculate it one time outside the benchmarking object, then extract it.
[edited:] Please check the latest results for the dot product generated by the GitHub Action workflow.: https://github.com/gha3mi/forbenchmark/tree/main/benchmarks/dot
[edited:] Please check the latest results for the dot product generated by the GitHub Action workflow.: https://github.com/gha3mi/forbenchmark/tree/main/benchmarks/dot
Excellent! These results are very interesting! I'll push a version as is, though locally I had to:
link = ["lapack", "blas"]
from the fpm.toml (when running with ifort and ifx)-llapack -lblas
for -qmkl
for both ifort and ifx in the fpm.rsp-flto=full
instead of -ipo
(haven't test nvfortra as I have to clean up my install)I tried something:
time = 0._rk
call bench%start_benchmark(7,'kahan', "a = fprod_kahan(u,v)",[p])
do nl = 1,bench%nloops
time(0) = timer()
u = u + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
v = v + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
time(1) = time(1) + timer() - time(0)
a = fprod_kahan(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, 'inner time: ', time(1)/bench%nloops
...
real(8) function timer() result(y)
call cpu_time(y)
end function
And got results in the lines of
Meth.: kahan; Des.: a = fprod_kahan(u,v) ; Argi.:100000
Elapsed time : 0.000060600 [s]
Speedup : 0.987 [-]
Performance : 1.650 [GFLOPS]
inner time: 5.958200000000069E-005
So basically most of the time is actually in the two lines avoiding the optimization, and the dot product is almost transparent!
Maybe it would be better to test with larger arrays or splitting the loop in a different way.
This might be more representative:
a = 0._rk
do nl = 1,bench%nloops
a = a + dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, a
The cumulative variable + the print
forces the compiler to avoid optimizing to actually print the correct value. Removed m2 as it was stagnating:
ifort
ifx
gfortran
Thanks! I merged your PR. Today was busy, I'll take a look at the last messages later.
Hi,
I'm currently working on the ForBenchmark project and have generated some results for the
dot_product
here. If you are interested, you can add yourdot_product
implementation to this benchmark. Fpm makes it easy to include it as a dependency, and a Python script will generate the results.Best, Ali