Open jeinstei opened 2 years ago
After posting on the elegant forums, it sounds like gpuElegant in a production environment is relatively high risk, and probably shouldn't be in production images due to issues with CUDA, simulation data matching, and others.
Issue: https://www3.aps.anl.gov/forums/elegant/viewtopic.php?p=5180
The reply I received stated that CUDA 11 has issues with it, and that using it requires first having a successful non-GPU simulation completed. I'm going to check around with the elegant users and see if anyone has an urgent need for it.
No response from the forums yet, but just was thinking about possible automated offloading using newer compilers, similar to how nersc does things: https://docs-dev.nersc.gov/cgpu/software/compilers/
It should be readily possible to have the standard codebase be written in a way to support such optimizations instead of needing a separate build toolchain with custom CUDA kernels
After some experimentation, we're finding that the output of gpu-elegant for certain configurations seems to be failing. In particular, the particles don't seem to make it through the initial beamline elements, with subtleties as to what is going wrong. We're not quite sure where it is failing yet, but @cchall has some examples (see attached). TL;DR,
elegant
is losing particles.@cchall built some test files using a csbend component that highlight some of this.
I've rebuilt the newest version of
elegant
ongpu-jupyter
(2021.4) and it shows the same behavior; the default install is 2021.1. I'm currently awaiting account activation for theelegant
users' forum to post about the bug.There's also a local build issue with
elegant
and CUDA 11 w/ GCC 11 that is known on other projects, likely due to a cstd versioning issue. I'm debugging that myself to buildgpu-elegant
locally on another system for testing as to if it is the version in our container that has issues or the code itself: c++17 vs c++14. I still need to build with c++17 instead as this requires changing some flags.The (draft) build process generally follows the procedure in this Google doc, that is currently only available for RadiaSoft personnel: elegant build notes
All commands were run as
$ <binary> tracking.ele > <logfile>.log 2>&1
2021.1 non-GPU elegant (ele.log):
2021.1 gpu-elegant (gpu-standard-ele.log):
2021.4 gpu-elegant:
And an additional fun compilation message: