materialsproject / custodian

A simple, robust and flexible just-in-time job management framework in Python.
MIT License
136 stars 105 forks source link

VASP called by custodian is slower than directly called #153

Closed lixinyuu closed 4 years ago

lixinyuu commented 4 years ago

System

Summary

The univerisity have a migration from Ubuntu or Linux to RedHat, and I met a problem that VASP is slower if I call it from custodian, compared to call it directly by "mpirun -np 16 vasp_std". Is there any possible reasons for this kind of weird behaviour?

Example code

from custodian.custodian import Custodian
from custodian.vasp.handlers import VaspErrorHandler, \
    UnconvergedErrorHandler
from custodian.vasp.jobs import VaspJob
import os

vasp = 'vasp_std'
node = os.environ['NCPUS']
vasp_cmd = ['mpirun',"-np", str(node), vasp] # 
handlers = [VaspErrorHandler()]
jobs = VaspJob(vasp_cmd, auto_npar=False, auto_gamma=False )
c = Custodian(handlers, [jobs], max_errors=10)
c.run()

With the same input files, OUTCARs after one hour running:

  1. Custodian - Only the first iteration finished

    
    First call to EWALD:  gamma=   0.147
    Maximum number of real-space cells 5x 5x 1
    Maximum number of reciprocal cells 2x 2x 7
    
    FEWALD:  cpu time    0.2542: real time    0.2548

--------------------------------------- Iteration 1( 1) ---------------------------------------

POTLOK:  cpu time    0.2242: real time    0.2320
SETDIJ:  cpu time    0.2767: real time    0.2774

2. VASP called directly- Iteration 28 finished

energy without entropy = -126.53388816 energy(sigma->0) = -126.64889039


--------------------------------------- Iteration 1( 28) ---------------------------------------

POTLOK:  cpu time    0.2173: real time    0.2239
SETDIJ:  cpu time    0.0101: real time    0.0102


Any suggestions? Thanks.
shyuep commented 4 years ago

I have no idea what the environment variable NCPUs is. But I would suggest you just put 16 if that's the number you are going to use.

lixinyuu commented 4 years ago

@shyuep Thanks, Shyue. NCPUS is the number of CPU requested in PBS. It is 16 as well, and I see the same heading lines in the OUTCAR as below, which indicate this shouldn't be the reason.

 vasp.5.4.4.18Apr17-6-g9f103f2a35 (build Mar 18 2020 13:00:55) complex          

 executed on             LinuxIFC date 2020.05.15  21:28:44
 running on   16 total cores
 distrk:  each k-point on   16 cores,    1 groups
 distr:  one band on NCORES_PER_BAND=   4 cores,    4 groups

In custodian, the VASP is called in https://github.com/materialsproject/custodian/blob/883b491e6c6335a75ca88e2855948ccf4f881696/custodian/vasp/jobs.py#L278 Have you saw any case that subprocess.Popen(cmd) is slower than cmd directly? Thanks.

shyuep commented 4 years ago

I am not aware of why subprocess would result in a slow down instead of cmd directly. As far as we have tested it in our systems, this does not seem to be the case.

lixinyuu commented 4 years ago

Thanks Professor. The problem identified as environment variable OMP_NUM_THREADS need to be set as 1 for VASP 5.4. Don't know the detailed mechanism of this but problem has been solved. Thanks.