logsdail / carmm

Scripts for creation, manipulation and analysis of geometric and electronic structure of molecular models
GNU General Public License v3.0
5 stars 18 forks source link

Subprocess/$SLURM_NNODES error #185

Open robinsonmt1 opened 1 week ago

robinsonmt1 commented 1 week ago

When running ASE 3.23 on CPUs, the $SLURM_NNODES variable isn't resolved properly and the calculation crashes when it tries to run the srun command. The same calculation worked on GPUs. Along with the ASE version, the python environment also had to be changed to 3.11 to match. Whilst PLUMED was originally used when this error came up, the problem also remains even with a simple atoms.get_total_energy() command.

Some possible suggestions for the source of the problem:

robinsonmt1 commented 1 week ago

mtd.txt (Py script) py-submission.txt (Slurm script) e58363490.txt (Standard error file) aims_err.txt (Aims error file)

robinsonmt1 commented 2 days ago

Okay, an update from today's group hack session and subsequent tests myself. We identified a number of different issues which together caused various different crashes, so I went about testing them one by one and here are my findings:

  1. Order of module load commands: No effect. In my submission script, I load mpi and python. The order in which I load these does change which compilers are loaded, but in itself it doesn't affect anything.
  2. Python version: Matters. Initially, I used version 3.11.2 for this, but this was changed to 3.9.2. Version 3.9.2 is installed on the usual intel compilers which are loaded along with mpi, however 3.11.2 is only on the gnu compilers. Therefore, when using 3.11.2, the job will fail depending on the order of loading mpi and python either due to the python version not being installed on the loaded compilers or it loads the gnu compiler and messes up the calculation.
  3. Loading the fhi-aims module: No effect. The same compilers and calculation is run whether or not this is loaded.
  4. carmm/aims_calculator.py compute_forces argument: Matters. I think this might be a change for ASE 3.23. Previously, I haven't needed to specify compute_forces and left it as the default which is the string "true". However, the calculation fails to read control.in when this happens, but completes if the argument is instead specified as the boolean True.
  5. ASE calculators/genericfileio.py shell=True: Matters. Adding shell=True to the check_call() command in the run() function makes the job successful. Without this, the srun command can't find the file or directory of the aims executable.
  6. ASE calculators/genericfileio.py argv_command join: Matters. Just before the check_call() command, joining the argv_command into a string separated by spaces makes the job successful. Without this, the calculation fails at setup, having produced good control.in and geometry.in files, stating "No SCF steps present..." and it "raises StopIteration on empty file".

Therefore, the changes that I think must be made are: python version changed to 3.9.2, aims_calculator to change default compute_forces argument to the boolean True (or specified in input scripts), and updates to genericfileio.py in ASE to include shell=True and join the argv_command into a string.