lanl / benchmarks

Benchmarks
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Caliper Results #85

Open dmageeLANL opened 6 months ago

dmageeLANL commented 6 months ago

Post Caliper Results here for LLNL.

dmageeLANL commented 6 months ago

All points in the email from @pearce8 (Monday, February 12, 2024 at 15:37) satisfied.

caliper_amg.tar.gz

dmageeLANL commented 6 months ago

caliper_amg.tar.gz

New caliper amg with caliper-libs flag

dmageeLANL commented 6 months ago

caliper_amg_Noon.tar.gz amg_build.log caliper_build.log

Ok Here's another small caliper_amg run. You can see the configs in the out.amg... file. I'm including the build logs for amg and caliper too. One thing that might be messing things up is that I have to build caliper and adiak with gcc and the apps with intel.

Perhaps I could build amg/hypre with gcc as well, but branson and parthenon highly recommend using intel. Caliper can't be built with intel on roci (and I really tried) (not 100% sure about XRDS) because the intel compiler doesn't have the filesystem library and headers in the right place. Default is c++14 and even turning on c++17 doesn't get you the right config to make it happen. GNU's default is C++17 and cce builds fine with c++17 turned on.

Would building as a shared lib change anything?

dmageeLANL commented 6 months ago

Parthenon successful caliper run. Significantly reduced problem size.

caliper_parthenon.tar.gz

daboehme commented 6 months ago

Hi @dmageeLANL,

Looking at the parthenon logs it seems like the code creates a lot of threads (441 in this run to be exact). I suspect it's creating/destroying OS threads in a loop instead of using a fixed thread pool. That is a problem since Caliper keeps a fair amount of per-thread data around until flush and/or program exit, which would explain how it runs out of memory.

Currently the only way around this is to not put Caliper annotations on the sub threads. Instead, put annotations only on the main thread. A problem here might MPI if it's called from the sub threads. You can try running without the profile.mpi option and see if that avoids the problem. If only some MPI functions are called from sub threads you can also limit the instrumented MPI functions with the mpi.include option, e.g. profile.mpi,mpi.include=\"MPI_Allgather,MPI_Allgatherv,MPI_Allreduce\". This might help if the collective calls only happen on the main thread. Might be a good idea to do this anyway since there are a lot of short nonblocking functions like Iprobe, Test, Isend/Irecv etc. that probably introduce a lot of instrumentation overhead.

I can try and come up with some solutions to the memory issue, but I'd like to understand what is going on in Parthenon a bit better (e.g. is it only the MPI calls on the sub threads or some of your own annotations as well).

As a side note, the code spends a lot of time in MPI_Comm_dup, which might be a performance bug.