Open edansi opened 5 months ago
It looks like a custodian error but I am not sure.
There was this bug: https://github.com/materialsproject/custodian/issues/340
In this example, custodian indeed corrected somethng. But in other calculations I tried custodian didn't do any corrections and it still didn't update the database, so it seems like it's not the same problem.
I'm not quite sure where to look for indications for the error since the output looks normal. But if you have any ideas, I'm happy to try it!
lpad detect_lostruns
I had a similar issue with the job state update, but it only happened from time to time so that I lived with the lpad detect_lostruns
solution. Then I switched to jobflow-remote (for other reasons), so I never really figured out the origin of this issue.
Yes, that's also what my coworker experienced. Unfortunately, with my calculation it happens every time so detect_lostruns is not so helpful :/
How does the timing look like? Does the database insertion completely finish during the process or is the time only enough to finish the vasp run but not the whole database insertion? This happened to me before.
The vasp run usually finishes with plenty of time. The calculation takes something around 2h to finish but it has 12h available.
@utf @Zhuoying @janosh any ideas?
@edansi, do the VASP calculation files get gzipped for the hanging calculation?
@edansi, do the VASP calculation files get gzipped for the hanging calculation?
no they don't get zipped.
Ok, that would imply that the issue is not with database insertion since the VASP job has not yet got to the gzipping part. I agree this could be an issue with custodian. Potentially it was not able to kill the VASP processes successfully.
You could try writing a python script to run custodian in a directory containing the INCAR, KPOINTS, POSCAR, POTCAR and check if it finishes successfully. E.g., essential run the contents of this function: https://github.com/materialsproject/atomate2/blob/06e4a715037ac1a86d7bfe3af5fb6b75236123bc/src/atomate2/vasp/run.py#L84
@utf I ran the calculation, and the vasp calculation finished. How can I see if the custodian killed the VASP processes correctly?
I get all the output files, a custodian.json and a std_err.txt. After the Vasp calculation finishes, the slurm job continues running until the time limit.
Hi @edansi, if you check your custodian.json file (or share it here), there should be a set of actions
for each custodian run. If actions
is an empty list, then custodian will kill the job, which it sounds like it is.
You can check which errors were caught and which corrective actions were taken like the following code snippet
import json
with open("custodian.json","r") as f:
cust_logs = json.load(f)
for idx, run in enumerate(cust_logs):
print(idx+1, [(correc["errors"], correc["actions"]) for correc in run["corrections"]])
My guess is that LargeSigmaHandler
couldn't lower the smearing enough to get your job to run.
@esoteric-ephemera thanks, i attached the custodian file and also the python code to submit. The file has neither errors nor actions, doesn't this mean that no errors occured?
from atomate2 import SETTINGS
from os.path import expandvars
import shlex
import logging
import os
from custodian.vasp.validators import VaspFilesValidator, VasprunXMLValidator
from custodian.vasp.jobs import VaspJob
from custodian import Custodian
from custodian.vasp.handlers import (
FrozenJobErrorHandler,
IncorrectSmearingHandler,
KspacingMetalHandler,
LargeSigmaHandler,
MeshSymmetryErrorHandler,
NonConvergingErrorHandler,
PositiveEnergyErrorHandler,
PotimErrorHandler,
StdErrHandler,
UnconvergedErrorHandler,
VaspErrorHandler,
WalltimeHandler,
)
logger = logging.getLogger(__name__)
# Default handlers
DEFAULT_HANDLERS = (
VaspErrorHandler(),
MeshSymmetryErrorHandler(),
UnconvergedErrorHandler(),
NonConvergingErrorHandler(),
PotimErrorHandler(),
PositiveEnergyErrorHandler(),
FrozenJobErrorHandler(),
StdErrHandler(),
LargeSigmaHandler(),
IncorrectSmearingHandler(),
KspacingMetalHandler(),
)
DEFAULT_HANDLERS = [*DEFAULT_HANDLERS, WalltimeHandler(wall_time=43200)] # walltime 12h in s
_DEFAULT_VALIDATORS = (VasprunXMLValidator(), VaspFilesValidator())
vasp_cmd = 'srun vasp_std'
# vasp job
vasp_job_kwargs = {}
vasp_cmd = expandvars(vasp_cmd)
split_vasp_cmd = shlex.split(vasp_cmd)
vasp_job_kwargs.setdefault("auto_npar", False)
jobs = [VaspJob(split_vasp_cmd, **vasp_job_kwargs)]
# Custodian
custodian_manager = Custodian(
DEFAULT_HANDLERS,
jobs,
validators=_DEFAULT_VALIDATORS,
max_errors=SETTINGS.VASP_CUSTODIAN_MAX_ERRORS
)
logger.info("Running VASP using custodian.")
custodian_manager.run()
Hey @edansi, yes your custodian file indicates no errors were raised
I'm confused about the "python script to submit" part - if you're adding jobs to your fireworks database, you want to launch them through fireworks. The code snippet you sent only runs a job with custodian, and doesn't handle any of the automated file writing, parsing, etc. It also looks like you were running with fireworks previously
Are you submitting jobs to your job scheduler using the command line interface with qlaunch
?
For debugging purposes, it might be better to completely eliminate the database insertion step / fireworks to see why the jobs aren't running. You can do that by manually submitting a job that runs this:
from jobflow import run_locally, JobStore
from maggma.stores import MemoryStore
flow_response = run_locally( < generic atomate2 flow >,
create_folders = True,
ensure_success = True,
store = JobStore(MemoryStore(), additional_stores={"data": MemoryStore()})
)
@esoteric-ephemera i misunderstood your comment before, my last answer was refering to what @utf was suggesting. Yes in the custodian.json from my original example there's an action for the LargeSigmaHandler. But I don't think it's related to this, first, because the vasp output files look normal and finished and second, because I have calculations where I don't get this error and it still doesn't work.
I ran the job locally as you suggested with a DoubleRelaxMaker() and it worked, it also zipped everything. There were some custodian action but it didn't stop the job from finishing. Does this mean the problem lies somewhere else?
This is the job.error file from the local run
Switching to atp/3.14.5.
Switching to cray-mpich/7.7.18.
Switching to craype/2.7.10.
Switching to modules/3.2.11.4.
Switching to nvidia/21.3.
Switching to perftools-base/21.09.0.
Switching to pmi/5.0.17.
WARNING in EDDRMM: call to ZHEGV failed
ERROR:custodian.custodian:VaspErrorHandler
INFO:jobflow.core.job:Finished job - relax 1 (6afd7b4c-1d95-4cb3-b4c7-ade6766a53b8)
WARNING:jobflow.managers.local:Response.stored_data is not supported with local manager.
INFO:jobflow.core.job:Starting job - relax 2 (5fd464d9-7826-4740-9ba4-4f74c22449bb)
INFO:jobflow.core.job:Finished job - relax 2 (5fd464d9-7826-4740-9ba4-4f74c22449bb)
WARNING:jobflow.managers.local:Response.stored_data is not supported with local manager.
INFO:jobflow.managers.local:Finished executing jobs locally
and the corrections the custodian took in the local run are:
"corrections": [
{
"errors": [
"eddrmm"
],
"actions": [
{
"dict": "INCAR",
"action": {
"_set": {
"ALGO": "Normal"
}
}
},
{
"file": "CHGCAR",
"action": {
"_file_delete": {
"mode": "actual"
}
}
},
{
"file": "WAVECAR",
"action": {
"_file_delete": {
"mode": "actual"
}
}
}
],
Great to hear and absolutely no worries. I suspect that the issue lies with your fireworks or jobflow JobStore
configuration, but hard to say.
I usually re-export all of the yaml config file environment variables, ATOMATE2_CONFIG_FILE
, FW_CONFIG_FILE
, and JOBFLOW_CONFIG_FILE
, in my_qadapter.yaml
. Make sure these point to the right locations / your MongoDB store is accessible?
There were some custodian action but it didn't stop the job from finishing.
This behavior is custodian working as intended, which is also a good sign
@esoteric-ephemera setting the environment variables doesn't change anything.
I realized that one of my calculations at some point actually worked, so I tried to figure out what was different. When I reran the local by setting my kpoints I got an error from the monty package which I solved by updating to a different version. For a short moment I thought it was solved but it still doesn't fix it for all my calculations.
So now I'm setting up a new environment from scratch to see if that helps.
Do you have any other idea what I could try?
It's hard to say what the issue is without more info. My guess is your fireworks is fine since the first screenshot you sent is at fw_id > 2000 (it worked at some point) and jobflow is the culprit
To test this, let's take atomate2 out of the equation and just use jobflow and fireworks:
from jobflow import job, Response
@job
def simple_job(x):
if x < 100:
new_job = simple_job(x + 2)
return Response(replace = new_job, output = new_job.output)
else:
return 100
if __name__ == "__main__":
from fireworks import LaunchPad
from jobflow.managers.fireworks import flow_to_workflow
fw_job = flow_to_workflow(simple_job(2))
lpad = LaunchPad.auto_load()
lpad.add_wf(fw_job)
Can you add this to your fw database and run it on hpc? (just on a debug or shared queue, I know it's a terrible use of compute)
Thanks for helping me out, I really appreciate it :) I ran the code you suggested and it ends in a FIZZLED state with the following error message:
Traceback (most recent call last):\n File \"/users/esimmen/miniconda3/envs/2407_new_atomate2/lib/python3.9/site-packages/fireworks/core/rocket.py\", line 261, in run\n m_action = t.run_task(my_spec)\n File \"/users/esimmen/miniconda3/envs/2407_new_atomate2/lib/python3.9/site-packages/jobflow/managers/fireworks.py\", line 177, in run_task\n response = job.run(store=store)\n File \"/users/esimmen/miniconda3/envs/2407_new_atomate2/lib/python3.9/site-packages/jobflow/core/job.py\", line 600, in run\n response = function(*self.function_args, **self.function_kwargs)\nTypeError: 'dict' object is not callable\n
My bad, I forgot to mention that the function fireworks calls has to live in your PYTHONPATH
. If you make a file called jobflow_debug.py
which contains:
from jobflow import job, Response
@job
def simple_job(x):
if x < 100:
new_job = simple_job(x + 2)
return Response(replace = new_job, output = new_job.output)
else:
return 100
and ensure this file lives in your PYTHONPATH
environment variable, and then add it to your fw launchpad as:
from fireworks import LaunchPad
from jobflow.managers.fireworks import flow_to_workflow
from jobflow_debug import simple_job
fw_job = flow_to_workflow(simple_job(2))
lpad = LaunchPad.auto_load()
lpad.add_wf(fw_job)
Double checked that this approach works on my end.
thanks! so, if i run the simple_job, it finishes without problems and the job changes to COMPLETED in the database :/ the problem doesn't seem to be there either.
when i opened the issue, the cluster where this problem occurred was the only running hpc ressource but now i can finally run on the other cluster again. so it's also fine if we don't solve this problem. sorry for the inconvenience.
Describe the bug When I'm running a double relaxation workflow with Vasp with SmNiO3, the first relaxation finishes but the database doesn't update the status to completed or start the second relaxation step. If I'm doing a static calculation on the same structure everything runs fine and the database updates to completed once it's finished. If I run the double relaxation on a simple BaTiO3 cubic unit cell (5 atoms) the double relaxation is updated in the database and the status changes to completed once it's finished.
There's no unusual errors in the fireworks out or error file and the vasp.out file shows the calculation finished without problems.
I'm not sure if the level occurs on the atomate2 or fireworks or a different level. But thanks already for your help!
To Reproduce I'm using
I tried different settings to submit the calculations, this is the last one I tried and the shortest code. It's for a RelaxBandStructure workflow but it also fails at the initial relaxation.
here's the dict of the starting structure:
Expected behavior I would expect the database to update to COMPLETED or FIZZLED once the vasp calculation finishes but it stays at RUNNING. It only updates once I use the command lpad detect_lostruns.
Screenshots Fireworks out file
Fireworks error file