Sampler crashes on cluster due to different node CPUs used for compiling and running

smash-transport / smash-vhlle-hybrid

Event-by-event hybrid model for the description of relativistic heavy-ion collisions

GNU General Public License v3.0

3 stars 0 forks source link

In my hybrid runs on Virgo the sampler crashes sometimes. @NGoetz told me that the problem is due to different CPUs of the nodes (intel vs. amd) of the cluster. If the processor of the node used to compile the sampler does not fit the processer of the node which is used to run the sampler, it crashes. He suggested that it might make sense to include the sampler and maybe even the whole hybrid framework in the next container to avoid this problem.

Currently, there are two workarounds to avoid these crashes.

The first one makes sense for a lot of runs/array jobs and is used by Niklas. He has two executables (one compiled on an intel node, one on an amd node) and adapts his hybrid handler configs (i.e., the executable for the sampler part) during run time to execute the right one corresponding to the node.
The second possibility is suitable for fewer runs. Simply use #SBATCH --constraint=intel or #SBATCH --constraint=amd in your slurm job scripts which requests nodes that feature an intel or amd processor respectively (for further insight, see Virgo User Guide on feature constraints). This has to correspond to the cpu which was used for the sampler compilation.

Are there any other possibilities or ideas how to solve this issue? I attached the slurm outputs, but they are not very insightful and @NGoetz said he spent three weeks back in the day to figure out what the problem was and find his workaround.

slurm_Hybrid_vtk_8014065.out.txt slurm_Hybrid_vtk_8014065.err.txt

Hi! I think that Niklas's approach is the best one, but one could think to two additional possibilities, one suitable for short jobs and the other for long jobs. The third alternative is to compile the sampler enabling the maximum set of features common to all CPUs. A safe choice is -march=x86-64, that we are using on purpose for the container images. The price to pay is some performance loss, albeit probably limited. The fourth alternative, if you execute jobs long enough that the compilation time can be considered negligible with respect to the whole runtime, is to compile the sampler at the beginning of each job on the node where it will run.

About the container: indeed, it would be convenient to have everything inside a container. However, please notice that:

to be portable across the various cpus, the container images do not exploit all the features of the most recent processors. Except for short simulations it was recommended to recompile SMASH before using it (in the container there is everything needed to quickly do that), but I am afraid that most people never do that. One could consider to add a bash script to make this step easier and faster, so to improve performance without sacrificing user friendliness.
the container image is already very fat, making it obese is not necessarily bad, but one could consider to prepare different types of containers for different purposes, like production, development and postprocessing. In particular, it might be good to have specialized executables for Virgo and a wrapper script that automatically select the right one for a certain node.

smash-transport / smash-vhlle-hybrid

Sampler crashes on cluster due to different node CPUs used for compiling and running #40