ulissigroup / vasp-interactive

GNU Lesser General Public License v2.1
58 stars 12 forks source link

Can't run vasp via srun。 #55

Open yinkaaiwu opened 10 months ago

yinkaaiwu commented 10 months ago

Hello, I am trying to run VASP calculations on a Slurm cluster using srun, but I have encountered a very strange issue. When I check the job status with squeue, it shows that my job is "Running," but in reality, no VASP processes are being started. VASP-interactive also doesn't produce any errors. It successfully creates the initial files but gets stuck in the 'while self.process.poll() is not None' loop, which is quite strange.

I tried using submitit with subprocess.Popen() and mpirun to execute your _start_vasp_process() function, and I encountered the same bug. It was only when I modified the command parameter from 'mpirun -np xx vasp_std' to 'vasp_std' that I was able to successfully start VASP on a single thread.

I have tried many things and ruled out environment variables as the possible cause, but I still can't find the reason for this bug. Although I feel that this may not be an issue with your code and could be related to Slurm or mpirun, I believe others might have faced similar problems. Therefore, I have opened an issue in the hope of getting a solution from you. Thank you!

Below is the code I used to run VASP with srun:

from ase.optimize import BFGS
from vasp_interactive import VaspInteractive
from ase.db import connect

def runvasp(params, atoms, path):
    params['directory'] = path
    with VaspInteractive(**params) as vi:
        atoms.set_calculator(vi)
        dyn = BFGS(atoms=atoms,
                   maxstep=0.15,
                   trajectory=f'{path}/vasp_relaxation.traj',
                   logfile=f'{path}/vasp_BFGS.log')
        dyn.run(fmax=0.05, steps=2)
    return dyn.get_number_of_steps()

params = dict(
    system='VaspJet',
    command='srun -p CLUSTER -N 1 -n 48 vasp_gam -J test1 ',
    xc='PBE',
    lreal='Auto',
    kpts=(1, 1, 1),
    lmaxmix=4,
    encut=300,
    ismear=0,
    sigma=0.05,
    algo='fast',
    prec='Normal',
    nsw=2000,
    ibrion=-1,
    npar=4,
    isif=3,
    nwrite=1,
    lwave=False,
    lcharg=False,
    txt='vasp.out'
)

atoms1 = connect('./AuAgPt.db').get_atoms(id=1)
runvasp(params, atoms1, './test')

This is the code I used with submitit and subprocess.Popen() to start VASP with mpirun.

import submitit
import time
from subprocess import Popen, PIPE

def startvasp(cwd):
    process = Popen(
        args='mpirun -np 48 vasp_gam',
        shell=True,
        stdin=PIPE,
        stdout=PIPE,
        stderr=PIPE,
        cwd=cwd,
        universal_newlines=True,
        bufsize=0
    )
    stdout, stderr = process.communicate()
    return process.pid, process.poll(), stdout, stderr

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(
    timeout_min=3600,
    slurm_partition="CLUSTER",
    nodes=1,
    tasks_per_node=1,
    cpus_per_task=48,
    slurm_setup=[
    ]
)
jobs = []
for i in range(1):
    executor.update_parameters(slurm_job_name=f"test1")
    job = executor.submit(startvasp, '/home/fwtop/vaspjet-test/part-1.0/10')
    # job = executor.submit(startvasp, '/home/wyk')
    jobs.append(job)

time.sleep(2)
print(jobs[0].get_info())
print(jobs[0].result()[:2])
print(jobs[0].result()[2])
print(jobs[0].result()[3])

This is the work_dir looks like:

(base) [redhat@gpu test]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               131   CLUSTER         test1       fwtop  R       0:43      1 hpc-1-806
(base) [redhat@gpu test]$ ls
ase-sort.dat  INCAR  KPOINTS  POSCAR  POTCAR  STOPCAR  vasp_BFGS.log  vasp.out  vasp_relaxation.traj