grimme-lab / enso

energetic sorting of conformer rotamer ensembles
https://xtb-docs.readthedocs.io/en/latest/enso_doc/enso.html
GNU Lesser General Public License v3.0
10 stars 8 forks source link

ENSO 2.0.2 and slurm #14

Closed debruinb closed 3 years ago

debruinb commented 3 years ago

Could it be that ENSO 2.0.2 doesn't properly terminate? It seems the program runs fine from the command line on our cluster, and it also executes fine on a node when executed via slurm (checked after login), but after 'finishing' my slurm script is somehow unable to copy the results back to my home folder. It seems this is because enso does not properly terminate after the calculations are done. I don't have the same problem on our cluster when using ENSO 1.2.7, but all versions since version 2.0 and higher seem to have the same issue.

fabothch commented 3 years ago

@debruinb what I changed from version 1.27 to 2.0 is using multiprocessing.process instead of threading.Thread. In my opinion this should not affect the program terminating correctly. A quick and dirty solution is to write sys.exit(0) at the end of the script:

if __name__ == "__main__":
    main(argv=None)
    sys.exit(0)

I am currently working on a rewrite of enso and will update this soon. Please reply if this fixed the issue and I will add it to the current version.

best,

Fabian

debruinb commented 3 years ago

Thanks for the suggestion. However, unfortunately adding sys.exit(0) did not help. Actually I do seem to have the same problem on the command line (enso.py -run > enso.out 2> enso.error). After enso has finished (checking with top) the command line remains unresponsive until typing crtl+C (except if I run in the background with "enso.py -run > enso.out 2> enso.error $", but I cannot use that in my slurm scripts because then other command get executed before enso is finished and the files don't get copied back anyway).

debruinb commented 3 years ago

I should maybe add the information that I'm using enso with turbomole 7.5 (with which enso version 1.2.7 seems to work fine).

fabothch commented 3 years ago

Ok, that was worth a try! Just to clarify. The output in the file enso.out is complete and no line is missing? Are you using export PYTHONUNBUFFERED=1 ? Do you see any python processes still running in top? I am looking into it, and try to reproduce it.

Can you use something like this in slurm?


   enso.py -run > enso.out 
   pid=$!
   wait $pid
fabothch commented 3 years ago

I should maybe add the information that I'm using enso with turbomole 7.5 (with which enso version 1.2.7 seems to work fine).

Ok, I have not tested TM 7.5 so far. I will give it a try later on. Do you see any turbomole related processes running in the background?

debruinb commented 3 years ago

No the funny thing is that enso steers the calcualtions correctly (as it seems). Turbomole calculates the shifts and coupling constants correctly. If I login on he node (or run enso standalone from the command line) and copy back the directory to my home folders, all expected files are generated and anmr works fine. No remaining ghost jobs of turbomole or anything if I login to the node and check with top (during the running calculation they are running of course). The only problem seems to be that enso somehow doesn't return a term signal and hence the files are not copied back when using slurm.

debruinb commented 3 years ago

The slurm file to submit is this one:

!/bin/bash

SBATCH --mem=MaxMemPerNode

SBATCH --export=ALL

SBATCH --cpus-per-task=16

SBATCH -p short

SBATCH --time=00:05:00

wait export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK export OMP_STACKSIZE=1000m ulimit -s unlimited export PARA_ARCH=SMP source $TURBODIR/Config_turbo_env export PARNODES=$SLURM_CPUS_PER_TASK

wait WORKDIR=/scratch/$USER/ethane_enso_only_standalone-${SLURM_JOBID%%.} mkdir -p $WORKDIR wait cd $WORKDIR wait cp -rf $SLURM_SUBMIT_DIR/ . wait crest -nmr -g chcl3 -chrg 0 enso.py export PYTHONUNBUFFERED=1 enso.py -run > enso.out 2> enso.error wait sleep 30 cp -rf * $SLURM_SUBMIT_DIR/ wait rm -rf $WORKDIR wait cd $SLURM_SUBMIT_DIR acttag=`date|sed "s/ / 0/g"|cut -d" " -f2,3,6 --output-delimiter="" echo $SLURM_SUBMIT_DIR>> /home/whoami`/Jobs_finished.$act_tag wait

debruinb commented 3 years ago

The above script doesn't copy back results to my home folder.

Running the above script in steps, the following does work: crest -nmr -g chcl3 -chrg 0 enso.py (removing enso.py -run > enso.out 2> enso.error)

It goes wrong in a subsequent step with: export PYTHONUNBUFFERED=1 enso.py -run > enso.out 2> enso.error

If the enso line is included no data are copied back anymore.

fabothch commented 3 years ago

To be honest, I have never worked with slurm and can only guess if it is enso or slurm related. I can not reproduce the 'missing' term signal after execution of enso.py -run > enso.out with either TM version 7.4.1 or version 7.5 (I only checked part1) . My terminal does correspond instantly. Which python version are you using?

debruinb commented 3 years ago

Hmm, that's strange. I'm using python 3.6.6. I can confirm that on the command line (no slurm) there is no problem with only part 1 (terminal is responsive after job is finished). But with part1-part 4 it's different: After the job finishes top shows no running jobs, but the terminal remains non-responsive. "ps -ef | grep bdebruin" gives me: bdebruin 30795 11776 0 14:13 pts/49 00:00:00 python3 /home/bdebruin/software/XTB_633/enso.py -run So enso is still running in the background, while the calculations are done. After ctrl+c the ghost job dispears (ps -ef | grep bdebruin).
I will test part 2-4 separately.

debruinb commented 3 years ago

part 1+2+3 work fine. The problem seems to occur in part 4.

fabothch commented 3 years ago

ok, that narrows it down! I am looking at this now!

fabothch commented 3 years ago

The escf.out output has changed from TM 7.4.1 to TM 7.5 this affects the reading of the coupling constants. This is done to get the files nmrprop.dat which are written to the NMR folders and only contain shielding constants and coupling constants. Can you have a look if these files nmrprop.dat are written?

debruinb commented 3 years ago

Looks like you found the problem. I can't find nmrprop.dat in my NMR folders.

fabothch commented 3 years ago

perfekt! I seperated the calculation and the readout (since a change in the printout can easily make the readout routine flawed) this explains why your calculations run smoothly and the printout is there but enso doesn't terminate.

This is easy to fix! Thanks for your patience and reporting the bug!

debruinb commented 3 years ago

Great! Looking forward to test further once fixed (no hurry).

fabothch commented 3 years ago

I updated the master branch (not the release).

debruinb commented 3 years ago

Great! This solved everything! Version 2.0.3 works fine. Thanks a lot for the fix.