Open RobinSattler opened 3 weeks ago
Hi! I think that Niklas's approach is the best one, but one could think to two additional possibilities, one suitable for short jobs and the other for long jobs. The third alternative is to compile the sampler enabling the maximum set of features common to all CPUs. A safe choice is -march=x86-64, that we are using on purpose for the container images. The price to pay is some performance loss, albeit probably limited. The fourth alternative, if you execute jobs long enough that the compilation time can be considered negligible with respect to the whole runtime, is to compile the sampler at the beginning of each job on the node where it will run.
About the container: indeed, it would be convenient to have everything inside a container. However, please notice that:
In my hybrid runs on Virgo the sampler crashes sometimes. @NGoetz told me that the problem is due to different CPUs of the nodes (intel vs. amd) of the cluster. If the processor of the node used to compile the sampler does not fit the processer of the node which is used to run the sampler, it crashes. He suggested that it might make sense to include the sampler and maybe even the whole hybrid framework in the next container to avoid this problem.
Currently, there are two workarounds to avoid these crashes.
#SBATCH --constraint=intel
or#SBATCH --constraint=amd
in your slurm job scripts which requests nodes that feature an intel or amd processor respectively (for further insight, see Virgo User Guide on feature constraints). This has to correspond to the cpu which was used for the sampler compilation.Are there any other possibilities or ideas how to solve this issue? I attached the slurm outputs, but they are not very insightful and @NGoetz said he spent three weeks back in the day to figure out what the problem was and find his workaround.
slurm_Hybrid_vtk_8014065.out.txt slurm_Hybrid_vtk_8014065.err.txt