mabarnes / moment_kinetics

Other
2 stars 4 forks source link

Precompile on rank-0 first in run_moment_kinetics.jl #170

Closed johnomotani closed 9 months ago

johnomotani commented 9 months ago

When running from the command line using the run_moment_kinetics.jl script, call using moment_kinetics first on rank-0, and only after that has completed call it on all other ranks. Hopefully this avoids precompilation clashes (when multiple MPI ranks try to precompile the same package simultaneously) which might corrupt files. It does not solve the clash issue for importing MPI, because using MPI is needed to get the MPI commands to synchronize between rank-0 and the other processes, but hopefully minimises the chance of problems.

johnomotani commented 9 months ago

An alternative (or additional) possible fix would be to add a line to submission scripts, e.g. if your command to run a simulation is

mpirun -np $SLURM_NTASKS julia --project -O3 --check-bounds=no run_moment_kinetics $INPUT_FILE

you could add a line, modifying it to

julia --project -O3 --check-bounds=no -e "using Pkg; Pkg.precompile()"
mpirun -np $SLURM_NTASKS julia --project -O3 --check-bounds=no run_moment_kinetics $INPUT_FILE

(It is important that the flags, here -O3 --check-bounds=no are the same between the two commands to avoid repeated precompilation.)

This alternative might actually speed things up a bit as Pkg.precompile() should run the precompilation in parallel, whereas I think the implicit precompilation when using moment_kinetics, etc., is called only happens in serial.

mrhardman commented 9 months ago

In order to deal with some of my precompile issues caused by working on multiple instances of moment_kinetics simultaneously, I have started using the command

cd path/to/moment_kinetics; JULIA_DEPOT_PATH=$(pwd); julia --project

whenever I start julia from the command line or in a submission script to make sure that the .julia files are installed in an encapsulated location for each separate development project folder.

We might mention this flag on the README.md.

mrhardman commented 9 months ago

There seem to be many related discussions of similar issues on the Julia discourse. Noting here for future reference.

https://discourse.julialang.org/t/how-does-one-set-up-a-centralized-julia-installation/13922 https://discourse.julialang.org/t/run-a-julia-application-at-large-scale-on-thousands-of-nodes/23873/3 https://discourse.julialang.org/t/precompilation-error-using-hpc/17094 https://discourse.julialang.org/t/repeated-precompilation-on-a-cluster-makes-life-difficult/94194/12

etc...