libAtoms / workflow

python workflow toolkit
GNU General Public License v2.0
24 stars 17 forks source link

Errorr encountered when trying to submit remote jobs. #290

Closed jungsdao closed 4 months ago

jungsdao commented 5 months ago

I sometime encounter this sort of error when trying to submit remote (DFT calculations) jobs to cluster. I think it was resolved by erasing jobs.db file or some remain scratch files in .expyre directory, but I wonder why this happens and if there's way to fundamentally avoid such error.

Traceback (most recent call last):
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/2_CHO_foundation/finetune_foundation.py", line 790, in <module>
    main(ecutwfc = ecutwfc, cv_range = args.cv_range, verbose=True)
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/2_CHO_foundation/finetune_foundation.py", line 676, in main
    run_dft(files["fps"], files["dft"], ecutwfc = ecutwfc)
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/2_CHO_foundation/finetune_foundation.py", line 300, in run_dft
    generic_calc(inputs = in_config, 
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/foundation/lib/python3.11/site-packages/wfl/calculators/generic.py", line 149, in calculate
    return autoparallelize(_run_autopara_wrappable, *args, default_autopara_info=default_autopara_info, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/foundation/lib/python3.11/site-packages/wfl/autoparallelize/base.py", line 174, in autoparallelize
    return _autoparallelize_ll(autopara_info, inputs, outputs, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/foundation/lib/python3.11/site-packages/wfl/autoparallelize/base.py", line 228, in _autoparallelize_ll
    out = do_remotely(autopara_info, iterable, outputspec, op, rng=rng_op, args=args, kwargs=kwargs,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/foundation/lib/python3.11/site-packages/wfl/autoparallelize/remote.py", line 89, in do_remotely
    xprs.append(ExPyRe(name=job_name, pre_run_commands=remote_info.pre_cmds, post_run_commands=remote_info.post_cmds,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/foundation/lib/python3.11/site-packages/expyre/func.py", line 296, in __init__
    config.db.add(self.id, name=name, from_dir=str(self.stage_dir), status=self.status)
  File "/home/hjung/miniforge3/envs/foundation/lib/python3.11/site-packages/expyre/jobsdb.py", line 114, in add
    self._execute(f'INSERT into jobs(id, name, from_dir, status, system, remote_id, remote_status, creation_time, status_time) '
  File "/home/hjung/miniforge3/envs/foundation/lib/python3.11/site-packages/expyre/jobsdb.py", line 44, in _execute
    res = self.db.execute(cmd)
          ^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: attempt to write a readonly database
bernstei commented 5 months ago

We used to have problems like that, but I haven't seen them in years now. It probably has to do with how sqlite locks the jobs db file so that multiple scripts trying to access it don't interfere. Could be something weird about your filesystem, could be that your script is getting killed while it's trying to modify the job status. I'd be interested to see the output of ls -al ~/.expyre next time this happens.

Instead of deleting, you can also try

xpr db_unlock

(assuming you installed wfl with pip or python -m pip, and your path is set up correctly).

jungsdao commented 5 months ago

I'll try to see the output of that command later when I encounter this error again.

bernstei commented 4 months ago

@jungsdao Can we close this one?

jungsdao commented 4 months ago

Yeah since I haven't encountered this error for a while, this can be closed.