lanl / benchmarks

Benchmarks
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Rebaseline UMT on Roci-HBM #46

Closed gshipman closed 7 months ago

gshipman commented 1 year ago

After Anthony completes rebaseline of Sparta, rebaseline using same methodology on Roci-HBM

gshipman commented 7 months ago

@aaroncblack just checking in, are you getting close to rebaselining with the new Mesh generator on RZWhippet? If so, we can get runs going on Roci.

gshipman commented 7 months ago

@aaroncblack , for conduit it looks like I need a newer CMake then what is specified here: https://lanl.github.io/benchmarks/06_umt/umt.html

+ cmake /usr/projects/eap/users/gshipman/benchmarks/umt/umt_workspace/build_conduit/../conduit/src -DCMAKE_INSTALL_PREFIX=/usr/projects/eap/users/gshipman/benchmarks/umt/umt_workspace/install -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_Fortran_COMPILER=ifx -DMPI_CXX_COMPILER=mpicxx -DMPI_Fortran_COMPILER=mpifort -DBUILD_SHARED_LIBS=OFF -DENABLE_TESTS=OFF -DENABLE_EXAMPLES=OFF -DENABLE_DOCS=OFF -DENABLE_FORTRAN=ON -DENABLE_MPI=ON -DENABLE_PYTHON=OFF
CMake Error at CMakeLists.txt:6 (cmake_minimum_required):
  CMake 3.21 or higher is required.  You are running version 3.20.4
gshipman commented 7 months ago

And getting a failure during the benchmark execution, appears to be a Teton driver / Conduit driver version mismatch.

aaroncblack commented 7 months ago

@gshipman thanks. Looks like they just released conduit 0.9.0 yesterday (skipped a 0.8.9 version) and bumped their minimum requirements to C++14 and CMake 3.21. I will update the UMT docs.

gshipman commented 7 months ago

@aaroncblack , cool, check the version mismatch issue, maybe the build script needs to check out a specific tag?

aaroncblack commented 7 months ago

@gshipman Yes, I'll check that. We were depending on their 'develop' branch, but with 0.9.0 now I can switch to a released version which is much better.

aaroncblack commented 7 months ago

@gshipman I pushed up a couple fixes to UMT, would you mind trying the 'develop' branch?

gshipman commented 7 months ago

@aaroncblack Got it, ran with:

for n in 1 8 32 56 88 112; do srun -n $n /usr/projects/eap/users/gshipman/benchmarks/umt/umt_workspace/install/bin/test_driver -c 10 -B local -d 8,8,0 --benchmark_problem 2 |& tee umt.$n.out; done

See attached. umt.tgz

aaroncblack commented 7 months ago

@gshipman That was for a quick smoke test, right? That's demonstrating a weak scaling run ('-B local' is local partition size per rank) on a very small 2d mesh ( ~ 66MB per rank ). But the output looks as I'd expect

For the actual benchmark and baselining you'll want "-B global" and you'll get a global mesh size which will automatically get partitioned across your mpi ranks, and crank up the mesh dimensions (-d x,y,z )

gshipman commented 7 months ago

@aaroncblack see attached. Results on Roci HBM (xRoads in the open), P1 is 14^3, P2 is 33^3. SPP1 14^3: spp1_strong_scaling_roci SPP2 33^3: spp2_strong_scaling_roci

33d3p2.tgz 14d3p1.tgz

aaroncblack commented 7 months ago

@gshipman Those look consistent with what scaling behavior I see on our LLNL intel cluster ( 2.0 GHz Intel Xeons, 112 cores, 256GB non-HBM RAM ).

I see a local decrease in the problem #1 scaling graph at the 88 rank and I expect vendors to local fluxuations on performanec as you scale up # ranks.

UMT's algorithm will require more iterations to converge as the mesh is decomposed over an increasing # of mpi ranks. On my local cluster I see increasing throughput as you scale up, until the solver tolerance is exceeded and it needs an additional iteration to converge. If you continue to scale up the # ranks, you'll recover that performance and continue to improve throughput, until the tolerance is exceeded again.

I'm thinking I should add a blurb about this to the benchmark docs so it doesn't catch a vendor off-guard.

gshipman commented 7 months ago

@aaroncblack Excellent, I noted for P1 the top end performance is higher on the HBM part, I think that makes sense if more memory bound. Do you concur?

I will update the docs to reflect this data.

aaroncblack commented 7 months ago

Yes.

There's a mix of performance bounds in the kernels, but in general:

P1 has a higher number of energy bins, and the loops over our energy bins is where our vectorization is. I expect P1 to exhibit more of a memory bandwith bound nature because the SIMD should be able to utilize the bandwith better.

P2 is only vectorized over 16 energy groups and I except more of a memory latency bound nature.

gshipman commented 7 months ago

@aaroncblack @richards12 UMT Crossroads data is live, see: https://lanl.github.io/benchmarks/06_umt/umt.html Thx!