materialsproject / custodian

A simple, robust and flexible just-in-time job management framework in Python.
MIT License
140 stars 108 forks source link

Add a VASP handler (warning) for too high NBANDS #324

Closed Andrew-S-Rosen closed 3 months ago

Andrew-S-Rosen commented 6 months ago

Summary

Closes #224.

If NBANDS is set to an unphysically high value, the resulting electronic structure properties can be completely erroneous. This is becoming more of a problem lately because VASP will automatically set NBANDS to the number of cores on a machine for small systems, and machines nowadays can have many cores (Perlmutter has 128 cores per node). Running a molecule like CO with 128 NBANDS is a major problem.

This handler checks to see if VASP automatically changed NBANDS due to parallelization. If it did, then it checks if NBANDS > 2 times the NELECT value. If so, we raise a warning. Unfortunately, we can't do much more than that because VASP will override the INCAR even if the user manually specifies NBANDS. The solution is for the user to rerun with fewer cores.

 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     The number of bands has been changed from the values supplied in        |
|     the INCAR file. This is a result of running the parallel version.       |
|     The orbitals not found in the WAVECAR file will be initialized with     |
|     random numbers, which is usually adequate. For correlated               |
|     calculations, however, you should redo the groundstate calculation.     |
|     I found NBANDS = 4. Now, NBANDS = 64.                                   |
|                                                                             |
 -----------------------------------------------------------------------------
yang-ruoxi commented 6 months ago

thanks @Andrew-S-Rosen ! There are a few things I would like to comment:

  1. "Large number of NBANDS causes erroneous results": Is this true? Adding empty orbitals in principle shouldn't matter. From what I understood, it is the opposite, where a lot of empty bands are needed to obtain converged results for many-body effects. The convergence for sure will be slow.

  2. The problem with CO here is that, before mentioning the unphysical NBANDS, is running a small molecule with too many cores (128!) for only 10 valence electrons. If one is using the default NCORE = 1, i.e. 1 core per orbital, it forces VASP to use 128 bands, that is over-parallelization resulting in inefficiency.

  3. NBANDS are not only dependent on NELECT, but also the total number of cores, and NCORE.

  4. Since NBANDS and NCORE are closely tied, a sanity check would be checking how many electrons are in the system, and hence the VASP command with respect to NELECT first. It ensure a sensible number of cores to run VASP with. That way, setting NCORE to a default value (1 or 2) would not lead NBANDS to explode.

Andrew-S-Rosen commented 6 months ago

Thanks for your input, @yang-ruoxi! All of your points are valid. That said, I'm still trying to figure out the best way to handle this in Custodian. Custodian can't handle anything about compute architecture, but we still want to make sure users are at the very least notified of potentially spurious results. I elaborate a bit more below.

"Large number of NBANDS causes erroneous results": Is this true? Adding empty orbitals in principle shouldn't matter. From what I understood, it is the opposite, where a lot of empty bands are needed to obtain converged results for many-body effects. The convergence for sure will be slow.

I will take some time to reproduce what I observed many months ago, but in short, I did observe erroneous behavior at times. In the CO example, there would be many spurious states in the DOS that would appear if NBANDS was >> NELECT. I believe @mkhorton had also observed some odd behavior in the past, but he'll have to chime in about that.

@esoteric-ephemera mentioned it might be a combination of too high NBANDS with not a high enough ENCUT. I'll explore that avenue as well.

The problem with CO here is that, before mentioning the unphysical NBANDS, is running a small molecule with too many cores (128!) for only 10 valence electrons. If one is using the default NCORE = 1, i.e. 1 core per orbital, it forces VASP to use 128 bands, that is over-parallelization resulting in inefficiency.

Yes, it is clear in this scenario that such calculations would be overly parallelized. Unfortunately, this is pretty common in high-throughput campaigns though because if you blindly run a bunch of calculations with a full node, you might do so on a small system inadvertently. Inefficiency is really a user problem though, so I'm not concerned about that from Custodian's perspective. Ignoring the NBANDS business, I don't think running with too many cores influences the DOS at all. It just makes the calculation slow.

NBANDS are not only dependent on NELECT, but also the total number of cores, and NCORE.

Yes. This is why I have checked the OUTCAR for the reported NBANDS value, which is always the one used by VASP (to the best of my knowledge). This should inherently take into account all of the factors you describe above. @mkhorton proposed doing a check to see if NBANDS > 2*NELECT simply as a rule-of-thumb for being "too high", especially in the scenario where VASP has automatically modified NBANDS.

Since NBANDS and NCORE are closely tied, a sanity check would be checking how many electrons are in the system, and hence the VASP command with respect to NELECT first. It ensure a sensible number of cores to run VASP with. That way, setting NCORE to a default value (1 or 2) would not lead NBANDS to explode.

I agree that the ideal approach would be for the user to run with fewer cores. However, Custodian doesn't have the ability to modify this in any clean way. That's why I have simply raised a warning.

Andrew-S-Rosen commented 6 months ago

@yang-ruoxi: Do you think raising NCORE in such a scenario might be an appropriate workaround? We could fetch the number of compute cores by parsing the OUTCAR file.

yang-ruoxi commented 6 months ago

@Andrew-S-Rosen, I see. It would be good to know what triggers the odd behaviors with high NBANDS to know what exactly needs to be done. But in reality it's hard to exhaust all possible scenarios, so warning is fine before it is pined down. And true, in the scope of custodian, it's hard to fine tune the vasp_cmd , so I suppose the considerations I mentioned are relevant but would apply outside of Custodian. I agree a workaround can be raising the NCORE and checking the OUTCAR, but I fear it would be an over correction in the scenario where high NBANDS wouldn't matter too much to the results. In cases we feel confident it should be corrected though (e.g. wrong combination of NBANDS + ENCUT), this could be a feasible solution.

Andrew-S-Rosen commented 6 months ago

Thanks for the input, @yang-ruoxi! That all makes sense. I'll go ahead and try to reproduce the spurious behavior with too high NBANDS, and we can take it from there. You're right that it'll be worthwhile to dig into the underlying cause.

JaGeo commented 6 months ago

I have also seen problems with this: convergence issues in general and also weird VASP failures.

Andrew-S-Rosen commented 4 months ago

Since it's not immediately clear to me the best way to correct such a calculation and since @yang-ruoxi has given me the blessing of "warning is fine before it is pined down", I'll call this PR ready to go. It doesn't do anything other than raise a warning if your number of bands is 2x higher than your number of electrons (unlikely to happen intentionally; usually only happens when running a small system with a large number of cores in a high-throughput campaign). If people see spurious warnings in their calculations, we can easily adjust this. In my view, this PR will basically do no harm and may potentially save someone a bit of headache.

esoteric-ephemera commented 4 months ago

It might be possible to correct this with custodian but a warning is probably more than enough. You'd have to read off the system info from OUTCAR (whether it's CPU or GPU, how many MPI ranks are used, how many OpenMP threads are used), and then try to modify the combination of {NCORE or NPAR, KPAR, NBANDS} on CPU builds and {NSIM, KPAR, NBANDS} on GPU builds. But playing around with parallelization can introduce other unexpected errors

A warning might be too low visibility but we can also add this to something like the pymatgen.io.validation add-on

Andrew-S-Rosen commented 4 months ago

Thanks for the input, @esoteric-ephemera. That was the exact conclusion I came to. It probably can be done, but I did not want to bother dealing with the messiness of parallelization. If someone cares more about this, I encourage them to give it a go 😅

Perhaps it will be too low visibility, but low is better than none! Not a super satisfying conclusion, but I unfortunately don't have the time to put into giving this a true fix. External validation could be interesting too.

Andrew-S-Rosen commented 4 months ago

@janosh: While we're on it, this one is ready to go. It is a very painless PR. It just raises a warning.