smash-transport / smash-vhlle-hybrid

Event-by-event hybrid model for the description of relativistic heavy-ion collisions
https://smash-transport.github.io/smash-vhlle-hybrid/
GNU General Public License v3.0
3 stars 0 forks source link

Sampler crashes on cluster due to different node CPUs used for compiling and running #40

Open RobinSattler opened 3 weeks ago

RobinSattler commented 3 weeks ago

In my hybrid runs on Virgo the sampler crashes sometimes. @NGoetz told me that the problem is due to different CPUs of the nodes (intel vs. amd) of the cluster. If the processor of the node used to compile the sampler does not fit the processer of the node which is used to run the sampler, it crashes. He suggested that it might make sense to include the sampler and maybe even the whole hybrid framework in the next container to avoid this problem.

Currently, there are two workarounds to avoid these crashes.

Are there any other possibilities or ideas how to solve this issue? I attached the slurm outputs, but they are not very insightful and @NGoetz said he spent three weeks back in the day to figure out what the problem was and find his workaround.

slurm_Hybrid_vtk_8014065.out.txt slurm_Hybrid_vtk_8014065.err.txt

gabriele-inghirami commented 3 weeks ago

Hi! I think that Niklas's approach is the best one, but one could think to two additional possibilities, one suitable for short jobs and the other for long jobs. The third alternative is to compile the sampler enabling the maximum set of features common to all CPUs. A safe choice is -march=x86-64, that we are using on purpose for the container images. The price to pay is some performance loss, albeit probably limited. The fourth alternative, if you execute jobs long enough that the compilation time can be considered negligible with respect to the whole runtime, is to compile the sampler at the beginning of each job on the node where it will run.

About the container: indeed, it would be convenient to have everything inside a container. However, please notice that: