Closed gshipman closed 7 months ago
@aaroncblack just checking in, are you getting close to rebaselining with the new Mesh generator on RZWhippet? If so, we can get runs going on Roci.
@aaroncblack , for conduit it looks like I need a newer CMake then what is specified here: https://lanl.github.io/benchmarks/06_umt/umt.html
+ cmake /usr/projects/eap/users/gshipman/benchmarks/umt/umt_workspace/build_conduit/../conduit/src -DCMAKE_INSTALL_PREFIX=/usr/projects/eap/users/gshipman/benchmarks/umt/umt_workspace/install -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_Fortran_COMPILER=ifx -DMPI_CXX_COMPILER=mpicxx -DMPI_Fortran_COMPILER=mpifort -DBUILD_SHARED_LIBS=OFF -DENABLE_TESTS=OFF -DENABLE_EXAMPLES=OFF -DENABLE_DOCS=OFF -DENABLE_FORTRAN=ON -DENABLE_MPI=ON -DENABLE_PYTHON=OFF
CMake Error at CMakeLists.txt:6 (cmake_minimum_required):
CMake 3.21 or higher is required. You are running version 3.20.4
And getting a failure during the benchmark execution, appears to be a Teton driver / Conduit driver version mismatch.
@gshipman thanks. Looks like they just released conduit 0.9.0 yesterday (skipped a 0.8.9 version) and bumped their minimum requirements to C++14 and CMake 3.21. I will update the UMT docs.
@aaroncblack , cool, check the version mismatch issue, maybe the build script needs to check out a specific tag?
@gshipman Yes, I'll check that. We were depending on their 'develop' branch, but with 0.9.0 now I can switch to a released version which is much better.
@gshipman I pushed up a couple fixes to UMT, would you mind trying the 'develop' branch?
@aaroncblack Got it, ran with:
for n in 1 8 32 56 88 112; do srun -n $n /usr/projects/eap/users/gshipman/benchmarks/umt/umt_workspace/install/bin/test_driver -c 10 -B local -d 8,8,0 --benchmark_problem 2 |& tee umt.$n.out; done
See attached. umt.tgz
@gshipman That was for a quick smoke test, right? That's demonstrating a weak scaling run ('-B local' is local partition size per rank) on a very small 2d mesh ( ~ 66MB per rank ). But the output looks as I'd expect
For the actual benchmark and baselining you'll want "-B global" and you'll get a global mesh size which will automatically get partitioned across your mpi ranks, and crank up the mesh dimensions (-d x,y,z )
@aaroncblack see attached. Results on Roci HBM (xRoads in the open), P1 is 14^3, P2 is 33^3. SPP1 14^3: SPP2 33^3:
@gshipman Those look consistent with what scaling behavior I see on our LLNL intel cluster ( 2.0 GHz Intel Xeons, 112 cores, 256GB non-HBM RAM ).
I see a local decrease in the problem #1 scaling graph at the 88 rank and I expect vendors to local fluxuations on performanec as you scale up # ranks.
UMT's algorithm will require more iterations to converge as the mesh is decomposed over an increasing # of mpi ranks. On my local cluster I see increasing throughput as you scale up, until the solver tolerance is exceeded and it needs an additional iteration to converge. If you continue to scale up the # ranks, you'll recover that performance and continue to improve throughput, until the tolerance is exceeded again.
I'm thinking I should add a blurb about this to the benchmark docs so it doesn't catch a vendor off-guard.
@aaroncblack Excellent, I noted for P1 the top end performance is higher on the HBM part, I think that makes sense if more memory bound. Do you concur?
I will update the docs to reflect this data.
Yes.
There's a mix of performance bounds in the kernels, but in general:
P1 has a higher number of energy bins, and the loops over our energy bins is where our vectorization is. I expect P1 to exhibit more of a memory bandwith bound nature because the SIMD should be able to utilize the bandwith better.
P2 is only vectorized over 16 energy groups and I except more of a memory latency bound nature.
@aaroncblack @richards12 UMT Crossroads data is live, see: https://lanl.github.io/benchmarks/06_umt/umt.html Thx!
After Anthony completes rebaseline of Sparta, rebaseline using same methodology on Roci-HBM