npadmana / DistributedFFT

6 stars 2 forks source link

Run UPC NPB-FT benchmark and compare to our results #33

Closed npadmana closed 4 years ago

npadmana commented 5 years ago

It appears that Cori used the UPC version of the NPB-FT benchmark as part of its RFP. The page is no longer linked from the NERSC website, but a little digging finds it here:

https://web.archive.org/web/20190130104503/http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/npb-upc-ft

@ronawho -- is it worth trying to see if we can run this version? From some of the papers, this should be a more optimized version of the reference benchmark.....

ronawho commented 5 years ago

Yeah, makes sense to me. For our core benchmarks, we try to compare against the best reference benchmark.

npadmana commented 5 years ago

On swan, case D :

Nodes (x32 cores) UPC time UPC MFlops
32 16.33 548778
64 8.67 1.03372e+06

~Looks about 2-3x faster than the Chapel version. Interestingly, the MFlops reported is lower -- so we should check that we are comparing apples to apples.~

npadmana commented 5 years ago

Hold on.... D in the UPC benchmark does not appear to be D in the Chapel benchmark. Hmm....

ronawho commented 5 years ago

You're just trying to give me a new performance target to obsess over :)

npadmana commented 5 years ago

Just ran it again (CLASS=DD is equivalent to case D). The timings are above.

This ran with the default Cray CC optimizations (I think -O2).

ronawho commented 5 years ago

cce 8 -> 9 changed from cray proprietary (now known as cce-classic) to a clang-based compiler. I'm not very familiar with the new clang-based version, so I'm not sure what the default optimization level is.

npadmana commented 5 years ago

Running with -O3 explicitly dropped the times to ~15.3 and ~8.3 seconds. I think sometime before PAW, it would be nice to run these. If they look very different from the reference benchmark, we could add in another curve. But I think we're good to go.

ronawho commented 4 years ago

https://www.nersc.gov/assets/pubs_presos/SCTutorialPGAS2012.pdf are some useful slides on the UPC optimizations to enable NB comm. There are also different comm strategies listed (slab, packed slab.)

npadmana commented 4 years ago

Another reference -- https://people.eecs.berkeley.edu/~demmel/cs267_Spr12/Lectures/lecture25_FFT_jwd12.ppt

ronawho commented 4 years ago

I think is is done with https://github.com/npadmana/DistributedFFT/pull/57

(I added our recent references to https://github.com/npadmana/DistributedFFT/issues/32)