BUG: database not updating after job is finished

edansi commented 5 months ago

Describe the bug When I'm running a double relaxation workflow with Vasp with SmNiO3, the first relaxation finishes but the database doesn't update the status to completed or start the second relaxation step. If I'm doing a static calculation on the same structure everything runs fine and the database updates to completed once it's finished. If I run the double relaxation on a simple BaTiO3 cubic unit cell (5 atoms) the double relaxation is updated in the database and the status changes to completed once it's finished.

There's no unusual errors in the fireworks out or error file and the vasp.out file shows the calculation finished without problems.

I'm not sure if the level occurs on the atomate2 or fireworks or a different level. But thanks already for your help!

To Reproduce I'm using

atomate2 0.0.14
Fireworks 2.0.3
custodian 2024.6.24 and the GPU version of Vasp 6.3.0

I tried different settings to submit the calculations, this is the last one I tried and the shortest code. It's for a RelaxBandStructure workflow but it also fails at the initial relaxation.

from atomate2.vasp.flows.core import RelaxBandStructureMaker
from jobflow.managers.fireworks import flow_to_workflow

maker = RelaxBandStructureMaker()

# Create and configure the workflow
static_flow = maker.make(struct_sno_p21n)
static_flow.update_config({"manager_config": {"_fworker": "daint", "_category": "edith"}})

# Update metadata

metadata = {"project": "test", "material": "SmNiO3", "structure": "bulk smnio3, P21/n, insulating, A-AFM", "comment": "test if any workflow works"}
static_flow.update_metadata(metadata)
wf = flow_to_workflow(static_flow)
# Load LaunchPad and submit or print dry run message
lpad = LaunchPad.auto_load()
lpad.add_wf(wf)

here's the dict of the starting structure:

{'@module': 'pymatgen.core.structure',
 '@class': 'Structure',
 'charge': 0.0,
 'lattice': {'matrix': [[5.324368576291776, 0.0, -0.003893673932206574],
   [-3.330581888101319e-16, 5.439253, 3.330581888101319e-16],
   [0.0, 0.0, 7.5644]],
  'pbc': (True, True, True),
  'a': 5.32437,
  'b': 5.439253,
  'c': 7.5644,
  'alpha': 90.0,
  'beta': 90.04190000000001,
  'gamma': 90.0,
  'volume': 219.06946998896532},
 'properties': {},
 'sites': [{'species': [{'element': 'Sm',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.98879, 0.05191, 0.2499],
   'xyz': [5.264682404551545, 0.28235162322999996, 1.8864935341525735],
   'properties': {},
   'label': 'Sm1'},
  {'species': [{'element': 'Sm',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.011210000000000053, 0.94809, 0.7501],
   'xyz': [0.05968617174023077, 5.15690137677, 5.67401279191522],
   'properties': {},
   'label': 'Sm1'},
  {'species': [{'element': 'Sm',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.51121, 0.55191, 0.2501],
   'xyz': [2.721870459886119, 3.00197812323, 1.8898659549491168],
   'properties': {},
   'label': 'Sm1'},
  {'species': [{'element': 'Sm',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.48878999999999984, 0.44809, 0.7499],
   'xyz': [2.6024981164056564, 2.43727487677, 5.670640371118677],
   'properties': {},
   'label': 'Sm1'},
  {'species': [{'element': 'Ni',
     'oxidation_state': None,
     'spin': 5,
     'occu': 1.0}],
   'abc': [0.5, 0.0, 0.0],
   'xyz': [2.662184288145888, 0.0, -0.001946836966103287],
   'properties': {},
   'label': 'Ni1'},
  {'species': [{'element': 'Ni',
     'oxidation_state': None,
     'spin': -5,
     'occu': 1.0}],
   'abc': [0.0, 0.5, 0.5],
   'xyz': [-1.6652909440506596e-16, 2.7196265, 3.7822],
   'properties': {},
   'label': 'Ni1'},
  {'species': [{'element': 'Ni',
     'oxidation_state': None,
     'spin': -5,
     'occu': 1.0}],
   'abc': [0.5, 0.0, 0.5],
   'xyz': [2.662184288145888, 0.0, 3.7802531630338967],
   'properties': {},
   'label': 'Ni2'},
  {'species': [{'element': 'Ni',
     'oxidation_state': None,
     'spin': 5,
     'occu': 1.0}],
   'abc': [0.0, 0.5, 0.0],
   'xyz': [-1.6652909440506596e-16, 2.7196265, 1.6652909440506596e-16],
   'properties': {},
   'label': 'Ni2'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.0842, 0.4844, 0.255],
   'xyz': [0.44831183412376735, 2.6347741532, 1.9285941526549084],
   'properties': {},
   'label': 'O1'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.9158, 0.5156000000000001, 0.745],
   'xyz': [4.876056742168008, 2.8044788468000004, 5.631912173412886],
   'properties': {},
   'label': 'O1'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.4158, 0.9843999999999999, 0.245],
   'xyz': [2.21387245402212, 5.3544006532, 1.8516590103789887],
   'properties': {},
   'label': 'O1'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.5842, 0.015600000000000003, 0.755],
   'xyz': [3.1104961222696557, 0.08485234680000002, 5.708847315688805],
   'properties': {},
   'label': 'O1'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.695, 0.292, 0.039],
   'xyz': [3.700436160522784, 1.5882618759999998, 0.2923054966171165],
   'properties': {},
   'label': 'O2'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.30500000000000005, 0.708, 0.961],
   'xyz': [1.6239324157689916, 3.8509911239999997, 7.268200829450677],
   'properties': {},
   'label': 'O2'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.805, 0.792, 0.461],
   'xyz': [4.28611670391488, 4.307888376, 3.484053992484574],
   'properties': {},
   'label': 'O2'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.19499999999999984, 0.20800000000000002, 0.539],
   'xyz': [1.0382518723768954, 1.1313646240000002, 4.07645233358322],
   'properties': {},
   'label': 'O2'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.208, 0.199, 0.959],
   'xyz': [1.1074686638686893, 1.082411347, 7.253449715822101],
   'properties': {},
   'label': 'O3'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.792, 0.8009999999999999, 0.041000000000000036],
   'xyz': [4.216899912423087, 4.356841652999999, 0.3070566102456929],
   'properties': {},
   'label': 'O3'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.29200000000000004, 0.6990000000000001, 0.541],
   'xyz': [1.5547156242771984, 3.8020378470000002, 4.091203447211797],
   'properties': {},
   'label': 'O3'},
  {'species': [{'element': 'O',
     'oxidation_state': None,
     'spin': 0,
     'occu': 1.0}],
   'abc': [0.708, 0.301, 0.4590000000000001],
   'xyz': [3.769652952014577, 1.6372151529999999, 3.4693028788559985],
   'properties': {},
   'label': 'O3'}]}

Expected behavior I would expect the database to update to COMPLETED or FIZZLED once the vasp calculation finishes but it stays at RUNNING. It only updates once I use the command lpad detect_lostruns.

Screenshots Fireworks out file

Fireworks error file

JaGeo commented 5 months ago

It looks like a custodian error but I am not sure.

There was this bug: https://github.com/materialsproject/custodian/issues/340

edansi commented 5 months ago

In this example, custodian indeed corrected somethng. But in other calculations I tried custodian didn't do any corrections and it still didn't update the database, so it seems like it's not the same problem.

I'm not quite sure where to look for indications for the error since the output looks normal. But if you have any ideas, I'm happy to try it!

QuantumChemist commented 5 months ago

lpad detect_lostruns

I had a similar issue with the job state update, but it only happened from time to time so that I lived with the lpad detect_lostruns solution. Then I switched to jobflow-remote (for other reasons), so I never really figured out the origin of this issue.

edansi commented 5 months ago

Yes, that's also what my coworker experienced. Unfortunately, with my calculation it happens every time so detect_lostruns is not so helpful :/

JaGeo commented 5 months ago

How does the timing look like? Does the database insertion completely finish during the process or is the time only enough to finish the vasp run but not the whole database insertion? This happened to me before.

edansi commented 5 months ago

The vasp run usually finishes with plenty of time. The calculation takes something around 2h to finish but it has 12h available.

JaGeo commented 5 months ago

@utf @Zhuoying @janosh any ideas?

utf commented 5 months ago

@edansi, do the VASP calculation files get gzipped for the hanging calculation?

edansi commented 5 months ago

@edansi, do the VASP calculation files get gzipped for the hanging calculation?

no they don't get zipped.

utf commented 5 months ago

Ok, that would imply that the issue is not with database insertion since the VASP job has not yet got to the gzipping part. I agree this could be an issue with custodian. Potentially it was not able to kill the VASP processes successfully.

You could try writing a python script to run custodian in a directory containing the INCAR, KPOINTS, POSCAR, POTCAR and check if it finishes successfully. E.g., essential run the contents of this function: https://github.com/materialsproject/atomate2/blob/06e4a715037ac1a86d7bfe3af5fb6b75236123bc/src/atomate2/vasp/run.py#L84

edansi commented 4 months ago

@utf I ran the calculation, and the vasp calculation finished. How can I see if the custodian killed the VASP processes correctly?

I get all the output files, a custodian.json and a std_err.txt. After the Vasp calculation finishes, the slurm job continues running until the time limit.

esoteric-ephemera commented 4 months ago

Hi @edansi, if you check your custodian.json file (or share it here), there should be a set of actions for each custodian run. If actions is an empty list, then custodian will kill the job, which it sounds like it is.

You can check which errors were caught and which corrective actions were taken like the following code snippet

import json

with open("custodian.json","r") as f:
  cust_logs = json.load(f)

for idx, run in enumerate(cust_logs):
  print(idx+1, [(correc["errors"], correc["actions"]) for correc in run["corrections"]])

My guess is that LargeSigmaHandler couldn't lower the smearing enough to get your job to run.

edansi commented 4 months ago

@esoteric-ephemera thanks, i attached the custodian file and also the python code to submit. The file has neither errors nor actions, doesn't this mean that no errors occured?

custodian.json

from atomate2 import SETTINGS
from os.path import expandvars
import shlex
import logging
import os
from custodian.vasp.validators import VaspFilesValidator, VasprunXMLValidator
from custodian.vasp.jobs import VaspJob
from custodian import Custodian
from custodian.vasp.handlers import (
    FrozenJobErrorHandler,
    IncorrectSmearingHandler,
    KspacingMetalHandler,
    LargeSigmaHandler,
    MeshSymmetryErrorHandler,
    NonConvergingErrorHandler,
    PositiveEnergyErrorHandler,
    PotimErrorHandler,
    StdErrHandler,
    UnconvergedErrorHandler,
    VaspErrorHandler,
    WalltimeHandler,
)

logger = logging.getLogger(__name__)

# Default handlers
DEFAULT_HANDLERS = (
    VaspErrorHandler(),
    MeshSymmetryErrorHandler(),
    UnconvergedErrorHandler(),
    NonConvergingErrorHandler(),
    PotimErrorHandler(),
    PositiveEnergyErrorHandler(),
    FrozenJobErrorHandler(),
    StdErrHandler(),
    LargeSigmaHandler(),
    IncorrectSmearingHandler(),
    KspacingMetalHandler(),
)
DEFAULT_HANDLERS = [*DEFAULT_HANDLERS, WalltimeHandler(wall_time=43200)] # walltime 12h in s
_DEFAULT_VALIDATORS = (VasprunXMLValidator(), VaspFilesValidator())

vasp_cmd = 'srun vasp_std'

# vasp job
vasp_job_kwargs = {}

vasp_cmd = expandvars(vasp_cmd)
split_vasp_cmd = shlex.split(vasp_cmd)

vasp_job_kwargs.setdefault("auto_npar", False)

jobs = [VaspJob(split_vasp_cmd, **vasp_job_kwargs)]

# Custodian
custodian_manager = Custodian(
    DEFAULT_HANDLERS,
    jobs,
    validators=_DEFAULT_VALIDATORS,
    max_errors=SETTINGS.VASP_CUSTODIAN_MAX_ERRORS
)

logger.info("Running VASP using custodian.")
custodian_manager.run()

esoteric-ephemera commented 4 months ago

Hey @edansi, yes your custodian file indicates no errors were raised

I'm confused about the "python script to submit" part - if you're adding jobs to your fireworks database, you want to launch them through fireworks. The code snippet you sent only runs a job with custodian, and doesn't handle any of the automated file writing, parsing, etc. It also looks like you were running with fireworks previously

Are you submitting jobs to your job scheduler using the command line interface with qlaunch?

For debugging purposes, it might be better to completely eliminate the database insertion step / fireworks to see why the jobs aren't running. You can do that by manually submitting a job that runs this:

from jobflow import run_locally, JobStore
from maggma.stores import MemoryStore

flow_response = run_locally( < generic atomate2 flow >, 
    create_folders = True, 
    ensure_success = True,
    store = JobStore(MemoryStore(), additional_stores={"data": MemoryStore()})
)

edansi commented 4 months ago

@esoteric-ephemera i misunderstood your comment before, my last answer was refering to what @utf was suggesting. Yes in the custodian.json from my original example there's an action for the LargeSigmaHandler. But I don't think it's related to this, first, because the vasp output files look normal and finished and second, because I have calculations where I don't get this error and it still doesn't work.

I ran the job locally as you suggested with a DoubleRelaxMaker() and it worked, it also zipped everything. There were some custodian action but it didn't stop the job from finishing. Does this mean the problem lies somewhere else?

This is the job.error file from the local run

Switching to atp/3.14.5.
Switching to cray-mpich/7.7.18.
Switching to craype/2.7.10.
Switching to modules/3.2.11.4.
Switching to nvidia/21.3.
Switching to perftools-base/21.09.0.
Switching to pmi/5.0.17.
WARNING in EDDRMM: call to ZHEGV failed
ERROR:custodian.custodian:VaspErrorHandler
INFO:jobflow.core.job:Finished job - relax 1 (6afd7b4c-1d95-4cb3-b4c7-ade6766a53b8)
WARNING:jobflow.managers.local:Response.stored_data is not supported with local manager.
INFO:jobflow.core.job:Starting job - relax 2 (5fd464d9-7826-4740-9ba4-4f74c22449bb)
INFO:jobflow.core.job:Finished job - relax 2 (5fd464d9-7826-4740-9ba4-4f74c22449bb)
WARNING:jobflow.managers.local:Response.stored_data is not supported with local manager.
INFO:jobflow.managers.local:Finished executing jobs locally

and the corrections the custodian took in the local run are:

        "corrections": [
            {
                "errors": [
                    "eddrmm"
                ],
                "actions": [
                    {
                        "dict": "INCAR",
                        "action": {
                            "_set": {
                                "ALGO": "Normal"
                            }
                        }
                    },
                    {
                        "file": "CHGCAR",
                        "action": {
                            "_file_delete": {
                                "mode": "actual"
                            }
                        }
                    },
                    {
                        "file": "WAVECAR",
                        "action": {
                            "_file_delete": {
                                "mode": "actual"
                            }
                        }
                    }
                ],

esoteric-ephemera commented 4 months ago

Great to hear and absolutely no worries. I suspect that the issue lies with your fireworks or jobflow JobStore configuration, but hard to say.

I usually re-export all of the yaml config file environment variables, ATOMATE2_CONFIG_FILE, FW_CONFIG_FILE, and JOBFLOW_CONFIG_FILE, in my_qadapter.yaml. Make sure these point to the right locations / your MongoDB store is accessible?

There were some custodian action but it didn't stop the job from finishing.

This behavior is custodian working as intended, which is also a good sign

edansi commented 4 months ago

@esoteric-ephemera setting the environment variables doesn't change anything.

I realized that one of my calculations at some point actually worked, so I tried to figure out what was different. When I reran the local by setting my kpoints I got an error from the monty package which I solved by updating to a different version. For a short moment I thought it was solved but it still doesn't fix it for all my calculations.

So now I'm setting up a new environment from scratch to see if that helps.

Do you have any other idea what I could try?

esoteric-ephemera commented 4 months ago

It's hard to say what the issue is without more info. My guess is your fireworks is fine since the first screenshot you sent is at fw_id > 2000 (it worked at some point) and jobflow is the culprit

To test this, let's take atomate2 out of the equation and just use jobflow and fireworks:

from jobflow import job, Response

@job
def simple_job(x):
    if x < 100:
        new_job = simple_job(x + 2)
        return Response(replace = new_job, output = new_job.output)
    else:
        return 100

if __name__ == "__main__":
    from fireworks import LaunchPad
    from jobflow.managers.fireworks import flow_to_workflow

    fw_job = flow_to_workflow(simple_job(2))
    lpad = LaunchPad.auto_load()
    lpad.add_wf(fw_job)

Can you add this to your fw database and run it on hpc? (just on a debug or shared queue, I know it's a terrible use of compute)

edansi commented 4 months ago

Thanks for helping me out, I really appreciate it :) I ran the code you suggested and it ends in a FIZZLED state with the following error message:

Traceback (most recent call last):\n  File \"/users/esimmen/miniconda3/envs/2407_new_atomate2/lib/python3.9/site-packages/fireworks/core/rocket.py\", line 261, in run\n    m_action = t.run_task(my_spec)\n  File \"/users/esimmen/miniconda3/envs/2407_new_atomate2/lib/python3.9/site-packages/jobflow/managers/fireworks.py\", line 177, in run_task\n    response = job.run(store=store)\n  File \"/users/esimmen/miniconda3/envs/2407_new_atomate2/lib/python3.9/site-packages/jobflow/core/job.py\", line 600, in run\n    response = function(*self.function_args, **self.function_kwargs)\nTypeError: 'dict' object is not callable\n

esoteric-ephemera commented 4 months ago

My bad, I forgot to mention that the function fireworks calls has to live in your PYTHONPATH. If you make a file called jobflow_debug.py which contains:

from jobflow import job, Response

@job
def simple_job(x):
    if x < 100:
        new_job = simple_job(x + 2)
        return Response(replace = new_job, output = new_job.output)
    else:
        return 100

and ensure this file lives in your PYTHONPATH environment variable, and then add it to your fw launchpad as:

from fireworks import LaunchPad
from jobflow.managers.fireworks import flow_to_workflow
from jobflow_debug import simple_job

fw_job = flow_to_workflow(simple_job(2))
lpad = LaunchPad.auto_load()
lpad.add_wf(fw_job)

Double checked that this approach works on my end.

edansi commented 4 months ago

thanks! so, if i run the simple_job, it finishes without problems and the job changes to COMPLETED in the database :/ the problem doesn't seem to be there either.

when i opened the issue, the cluster where this problem occurred was the only running hpc ressource but now i can finally run on the other cluster again. so it's also fine if we don't solve this problem. sorry for the inconvenience.

materialsproject / atomate2

BUG: database not updating after job is finished #912