Benchmark - Githubissues

gha3mi commented 9 months ago

Hi,

I'm currently working on the ForBenchmark project and have generated some results for the dot_product here. If you are interested, you can add your dot_product implementation to this benchmark. Fpm makes it easy to include it as a dependency, and a Python script will generate the results.

Best, Ali

jalvesz commented 9 months ago

Brilliant! Your project looks awesome, I'll take a look at that!

Thanks for sharing it!

gha3mi commented 9 months ago

Thanks! I've been thinking about a tool and a place to test Fortran fpm packages, not with the intention of competing, but with the aim of improving the packages.

jalvesz commented 9 months ago

That's a very good initiative, how are you thinking about proceeding? would you like PRs to centralize the Benchmarks and try to have them published with a github action?

I forked your project to try it out, I managed to get results with gfortran but I'm hitting a few dependencies issues with ifort and ifx.

I thought that having a companion sphinx-gallery would be a good way of having the plots neatly organized.

gha3mi commented 9 months ago

That's a very good initiative, how are you thinking about proceeding? would you like PRs to centralize the Benchmarks and try to have them published with a github action?

Yes, exactly. I think this may be the easiest way to get the results.

I forked your project to try it out, I managed to get results with gfortran but I'm hitting a few dependencies issues with ifort and ifx.

Each benchmark has an index. In your test, I noticed that the last one for fprod_kahan has the same number, 6, as fprod. call bench%start_benchmark(7, 'kahan', "a = fprod_kahan(u, v)", [p]) ! here 6 -> 7

Here are the flags I used for each compiler: fpm.rsp. I used LAPACK and BLAS. I also ran a dot benchmarking test using GitHub Actions here. The last step to create a pull request fails; it needs some work. However, the benchmarks with gfortran, ifort, ifx, and nvfortran work. Could it be an issue with LAPACK and BLAS in your case?

I thought that having a companion sphinx-gallery would be a good way of having the plots neatly organized.

It looks great. However, I am not familiar with it. I will take a look at it. If you could provide it, that would be great.

jalvesz commented 9 months ago

! here 6 -> 7

Upsss, that was a typo, fixed.

Yes, I saw the dependencies and managed to install blas/lapack for runing with gfortran. But for intel compilers I had to add a bunch of stuff https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html but I still have other compile errors and no time to actually solve it :s ... I'll continue later on to take a look... Also, I was thinking that the benchmarks will be more interesting with -O3 instead of -Ofast, which even compiler developers do not recommend. https://fortran-lang.discourse.group/t/is-ofast-in-gfortran-or-fast-flag-in-intel-fortran-safe-to-use/2755/4

Regarding setting up a sphinx gallery connected to a project, here they have an example and describe how to connect it with a source project.

For inspiration, I always look at PyVista they have a github repo with all the sources and a secondary repo that is automatically fed https://github.com/pyvista/pyvista/tree/main/doc ... something like this should work with a ForBenchmark-doc repo :)

gha3mi commented 9 months ago

Yes, I saw the dependencies and managed to install blas/lapack for runing with gfortran. But for intel compilers I had to add a bunch of stuff https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html but I still have other compile errors and no time to actually sole it :s ... I'll continue later on to take a look... Also, I was thinking that the benchmarks will be more interesting with -O3 instead of -Ofast, which even compiler developers do not recommend. https://fortran-lang.discourse.group/t/is-ofast-in-gfortran-or-fast-flag-in-intel-fortran-safe-to-use/2755/4

You can also use -qmkl instead of -llapack and -lblas. By the way, you are right; I will replace -Ofast with -O3.

Regarding setting up a sphinx gallery connected to a project, here they have an example and describe how to connect it with a source project.

For inspiration, I always look at PyVista they have a github repo with all the sources and a secondary repo that is automatically fed https://github.com/pyvista/pyvista/tree/main/doc ... something like this should work with a ForBenchmark-doc repo :)

Alright, I will take a look at it.

Thank you! If you find the time, you can send a pull request for the dot_product or any other implementations.

jalvesz commented 9 months ago

Perfect, if you get started with that, here a few dependencies that I use for sphinx projects: pip install numpydoc pydata-sphinx-theme sphinxcontrib-bibtex jupyter_sphinx sphinx_panels pythreejs

numpydoc for the documentation style within the python scripts pydata-sphinx-theme give the white theme used by PyVista sphinxcontrib-bibtex enables adding a .bib file that can be used to cite stuff within the project à la LaTeX jupyter_sphinx sphinx_panels pythreejs for integrating jupyter-notebook like stuff

jalvesz commented 9 months ago

You can also use -qmkl instead of -llapack and -lblas

So this worked, I had to comment out in the fpm.toml this link = ["lapack", "blas"]

Couple of questions: How do you measure the speedup? it seems like the ratio is inverted when I look at the plots and values in the data. I didn't check where you compute it, but I would have expected something like speed-up = time_reference / time_new_method, such that a speed-up > 1 implies faster. But this is not what I saw with the dot products.

The reference value is systematically the benchmark placed in the first place?

gha3mi commented 9 months ago

So this worked, I had to comment out in the fpm.toml this link = ["lapack", "blas"]

perfect!

You can also use -qmkl instead of -llapack and -lblas

So this worked, I had to comment out in the fpm.toml this link = ["lapack", "blas"]

Couple of questions: How do you measure the speedup? it seems like the ratio is inverted when I look at the plots and values in the data. I didn't check where you compute it, but I would have expected something like speed-up = time_reference / time_new_method, such that a speed-up > 1 implies faster. But this is not what I saw with the dot products.

The reference value is systematically the benchmark placed in the first place?

You can find it here: link to the code. Yes, you are right; this is inverted now. Thank you for reminding me. Feel free to send a pull request (PR) if you have time, or I will change it as soon as possible.

The reference value is systematically the benchmark placed in the first place?

Yes, exactly. I tried to provide an example demo with some comments here: link to the demo.

jalvesz commented 9 months ago

The results change quite a bit from one run to another, for instance here with ifort and the following flags -O3 -mtune=native -xHost -qmkl -qopenmp -ipo -DINT64 just two subsequent runs of the the bench:

dot_ifort_speedup

This is very nice and interesting, I think that from a statistical point of view is more than acceptable. I was just wondering then, how could the actual time of the function be extracted from the intermediate operations included to avoid for excessive optimization. This time changes also the ratio as ratio = (time_ref + C)/(time_i + C) is then influenced by that constant. Maybe an internal measurement in the loop should be done to capture these lines and extract it from the time captured by the bench object ?

Oh, just saw the label of the abscissa should be updated to method name ?

gha3mi commented 9 months ago

I am working on speeding up plots. I will write to you again here.

jalvesz commented 9 months ago

I'm wondering if something like this could help to have a clearer view:

call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
time = 0._rk !> a variable defined as time(0:1)
do nl = 1,bench%nloops
  time(0) = time(0) + timer() !> a function pointer using the selected method
  u = u + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
  v = v + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
  time(1) = time(1) + timer() 
  a = dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops , extract_time = time(1)-time(0) ) !> an optional variable to extract time 
!! from the analysis that is not associated with the function that is being benchmarked

?

gha3mi commented 9 months ago

I changed speed-up plot to plot for all problem sizes: dot_ifort_speedup

I also tried to plot the average weighted speed-up; however, I'm not sure if this provides valuable insights: dot_ifort_speedup_avg

The results change quite a bit from one run to another.

I think there are many factors such as the temperature of the CPU, other processes running during benchmarking, different random numbers,... However, I updated the code to use the same random numbers consistently.

This time changes also the ratio as ratio = (time_ref + C)/(time_i + C) is then influenced by that constant.

I noticed this before. But if you measure this time, you need to calculate the time of the timer function again! In my opinion, for large problem sizes, this could be simply avoided. Or maybe calculate it one time outside the benchmarking object, then extract it.

[edited:] Please check the latest results for the dot product generated by the GitHub Action workflow.: https://github.com/gha3mi/forbenchmark/tree/main/benchmarks/dot

jalvesz commented 9 months ago

[edited:] Please check the latest results for the dot product generated by the GitHub Action workflow.: https://github.com/gha3mi/forbenchmark/tree/main/benchmarks/dot

Excellent! These results are very interesting! I'll push a version as is, though locally I had to:

remove link = ["lapack", "blas"] from the fpm.toml (when running with ifort and ifx)
switch -llapack -lblas for -qmkl for both ifort and ifx in the fpm.rsp
Oh, and for inlining with ifx: -flto=full instead of -ipo (haven't test nvfortra as I have to clean up my install)

jalvesz commented 9 months ago

I tried something:

time = 0._rk
call bench%start_benchmark(7,'kahan', "a = fprod_kahan(u,v)",[p])
do nl = 1,bench%nloops
         time(0) = timer()
         u = u + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
         v = v + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
         time(1) = time(1) + timer() - time(0)
         a = fprod_kahan(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, 'inner time: ', time(1)/bench%nloops
...
real(8) function timer() result(y)
      call cpu_time(y)
end function

And got results in the lines of

Meth.: kahan; Des.: a = fprod_kahan(u,v) ; Argi.:100000
 Elapsed time :     0.000060600 [s]
 Speedup      :  0.987 [-]
 Performance  :  1.650 [GFLOPS]

 inner time:  5.958200000000069E-005

So basically most of the time is actually in the two lines avoiding the optimization, and the dot product is almost transparent!

Maybe it would be better to test with larger arrays or splitting the loop in a different way.

jalvesz commented 9 months ago

This might be more representative:

a = 0._rk
do nl = 1,bench%nloops
      a = a + dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, a

The cumulative variable + the print forces the compiler to avoid optimizing to actually print the correct value. Removed m2 as it was stagnating: ifort dot_ifort_speedup_avg

ifx dot_ifx_speedup_avg

gfortran dot_gfortran_speedup_avg

gha3mi commented 9 months ago

Thanks! I merged your PR. Today was busy, I'll take a look at the last messages later.

jalvesz / fast_math

Benchmark #8