libAtoms / workflow

python workflow toolkit
GNU General Public License v2.0
24 stars 17 forks source link

conflict in compatible ASE version? #291

Closed jungsdao closed 4 months ago

jungsdao commented 5 months ago

I think following part of generate/optimize.py requires the latest version of ASE '3.23.0b1' 6 from ase.filters import FrechetCellFilter

But wfl seems to conflict with espresso.py in ASE '3.23.0b1' showing following error. Because of this, I had to downgrade only espresso.py to make it work. (copied from ASE 3.22.1) I'm not totally sure this is related with ASE version though but downgrading didn't cause the error.

Exception: Failed to construct calculator, original attempt's exception was 'No configuration of espresso'
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/wfl/autoparallelize/pool.py", line 70, in _wrapped_autopara_wrappable
    outputs = op(*u_args, **kwargs)
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/wfl/calculators/generic.py", line 80, in _run_autopara_wrappable
    raise ValueError(f"Failed to construct calculator, original attempt's exception was '{calculator_failure_message}'")
ValueError: Failed to construct calculator, original attempt's exception was 'No configuration of espresso'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/raven/ptmp/hjung/GAP/scratch/unkownhost-_home_hjung/run_eval_dft_chunk_0_p3PO4jDEENxODL0w0lYdfsGOwcb77VNFitO-qwTh3mg=_zbpx1t0g/_ex
pyre_script_core.py", line 9, in <module>
    results = function(*args, **kwargs)
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/wfl/autoparallelize/pool.py", line 157, in do_in_pool
    for result_group in results:
  File "/u/hjung/conda-envs/mace_env/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
ValueError: Failed to construct calculator, original attempt's exception was 'No configuration of espresso'
bernstei commented 5 months ago

They keep on changing the way DFT calculators get initialized. I was pretty sure it was working with all the different ways Espresso was initialized. How exactly did you install ASE when it wasn't working? The version number isn't sufficient, because they keep making changes without changing the version number, at least in the gitlab version.

jungsdao commented 5 months ago

The way I installed ASE when it didn't work was : pip install --upgrade git+https://gitlab.com/ase/ase.git@master

bernstei commented 5 months ago

Thanks. Let me see if I can reproduce the problem. I assume you're also using the latest version of wfl ?

jungsdao commented 5 months ago

Yes, I'm also using the latest version of wfl. (v 0.2.0)

bernstei commented 5 months ago

I just tried with the latest ASE master branch (and the latest wfl main branch), and the Espresso-related tests passed. If you clone the wfl repo, you should be able to do (from the cloned directory)

pytest --basetemp ${HOME}/pytest_wfl -rxXs tests/calculators/test_qe.py

after setting the environment variable PYTEST_WFL_ASE_ESPRESSO_COMMAND to the command that run a serial pw.x (I use mpirun -np 1 pw.x for example). If that fails, we need to figure out why, since it's passing for me. If it passes, but your real script fails, we should be able to figure out why.

stenczelt commented 5 months ago

The way I installed ASE when it didn't work was : pip install --upgrade git+https://gitlab.com/ase/ase.git@master

This might be best put in the docs, or getting the ASE devs to finally make a release (3.22 was released in 2021) because if you install according to the wfl docs then you are seeing the same even with importing from wfl.generate.optimize import optimize.

bernstei commented 5 months ago

I'm confused - that git command above did work, or didn't? It looks like the command that should give the latest, which should work.

bernstei commented 5 months ago

@stenczelt As a person who ran into this, where do you think it should be documented so it's most likely to be noticed?

bernstei commented 5 months ago

Top level README.md ? Anyplace else? I guess the install command in the docs could, in principle drag in the older and incompatible ASE (although I was sort of assuming people had their own ASE already installed). I think there's a beta release number - we could require that as the minimum version, which will always fail until they actually have another release, but at least you'll know you have to do it manually.

jungsdao commented 5 months ago

Sorry for belated reply. I have checked again and now I found the point where it can be reproduced. This happens when Quantum espresso job is submitted to remote cluster and the ASE version installed in the cluster is 3.23.0b1. I think pytest in current wfl passed without error probably because it does not submit remote job and tested only locally. When I downgrade espresso.py in the cluster to older version ( like 3.22.1), I don't get this error.

bernstei commented 5 months ago

If you have the latest wfl and ASE (github master HEAD) on both local and remote machines, then it should definitely work.

bernstei commented 5 months ago

I also thought it should work with the older version, actually, so I'll also check why it's not.

bernstei commented 5 months ago

@jungsdao I just ran the wfl (the latest github version of wfl) pytests with the pip version of ASE (3.22.1), and it passed, and also with the latest gitlab master HEAD (3.23.0b1), and it also passed. I'm not sure why it's not working for you. Is it possible that the wfl version on the remote machine isn't the latest?

jungsdao commented 5 months ago

I have checked again after updating both ASE and wfl to the latest version but I'm having the same error. When I change espresso.py in remote cluster to ASE 3.22.1 it works, but with ASE 3.23.0b1 it fails.

bernstei commented 5 months ago

I'm not sure what's going on, but I don't see any way for the remote behavior to be different from the local behavior if they're running the same versions of wfl and ase. I guess I'll test it explicitly here.

Can you find the directory where the submitted job ran and grab all the output and error files and upload them here? I'm hoping there's more info on where exactly it's having a problem.

I wonder if something is messed up with the PYTHONPATH for the remote job, and it's not loading the wfl version you intend it to.

jungsdao commented 5 months ago

These are the related files in the submitted job directory. I'm not quite sure what's the source of error. It seems correctly launching intended version of wfl.

failed.tar.gz

bernstei commented 5 months ago

Thanks. I might need to give you a version that can produce better error information. I'll investigate some things here first.

bernstei commented 5 months ago

I just added a test that runs a remote Espresso job, and it runs fine (#294). I'll look a bit more, but I think there has to be some sort of version issue with you remote jobs. It's pretty easy for the remote job to end up with different paths, PYTHONPATH, etc. Can you describe your setup in more detail? Is it really a remote job, or is it just a queued job and the main workflow running on the login node of the HPC?

Can you post the workflow script (or, ideally, a simpler script that shows the same problem) here?

bernstei commented 5 months ago

If you can install wfl from the espresso_remote_job_test branch (instead of main) that version should provide us with better error information for the way your code is failing.

stenczelt commented 5 months ago

@stenczelt As a person who ran into this, where do you think it should be documented so it's most likely to be noticed?

A notice in the top level ReadMe is a good idea, I've actually looked at the documentation this time, so maybe a paragraph or one more code block in the Installation section would be useful: https://libatoms.github.io/workflow/#installation

bernstei commented 5 months ago

@stenczelt please take a look at the changes in #294 . I'm not sure there's an easy way to see the formatted docs (the README you can see by switching to that branch), but you can look at the .rst source file changes.

bernstei commented 4 months ago

@jungsdao Have you had a chance to test the espresso_remote_job_test branch? It should give more error information if you're still having this problem.

jungsdao commented 4 months ago

I have tried with espresso_remote_job_test branch in remote cluster and it gives following error. (from _expyre_job_error)

  1 Exception: Failed to construct calculator, original attempt's exception was '(exc)
  2 Traceback (most recent call last):
  3   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/generic.py", line 49, in _run_autopara_wrappable
  4     calculator_default = construct_calculator_picklesafe(calculator)
  5   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/utils/parallel.py", line 51, in construct_calculator_picklesafe
  6     return calculator[0](*c_args, **c_kwargs)
  7   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/espresso.py", line 88, in __init__
  8     super().__init__(keep_files=keep_files, rundir_prefix=rundir_prefix,
  9   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/wfl_fileio_calculator.py", line 48, in __init__
 10     super().__init__(**kwargs)
 11   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/ase/calculators/espresso.py", line 216, in __init__
 12     super().__init__(
 13   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/ase/calculators/genericfileio.py", line 336, in __init__
 14     raise EnvironmentError(f'No configuration of {template.name}')
 15 ase.calculators.calculator.EnvironmentError: No configuration of espresso
 16 '
 17 multiprocessing.pool.RemoteTraceback:
 18 """
 19 Traceback (most recent call last):
 20   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/multiprocessing/pool.py", line 125, in worker
 21     result = (True, func(*args, **kwds))
 22   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/autoparallelize/pool.py", line 70, in _wrapped_autopara_wrappable
 23     outputs = op(*u_args, **kwargs)
 24   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/generic.py", line 86, in _run_autopara_wrappable
 25     raise ValueError(f"Failed to construct calculator, original attempt's exception was '{calculator_failure_message}'")
 26 ValueError: Failed to construct calculator, original attempt's exception was '(exc)
 27 Traceback (most recent call last):
 28   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/generic.py", line 49, in _run_autopara_wrappable
 29     calculator_default = construct_calculator_picklesafe(calculator)
 30   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/utils/parallel.py", line 51, in construct_calculator_picklesafe
 31     return calculator[0](*c_args, **c_kwargs)
 32   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/espresso.py", line 88, in __init__
 33     super().__init__(keep_files=keep_files, rundir_prefix=rundir_prefix,
 34   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/wfl_fileio_calculator.py", line 48, in __init__
 35     super().__init__(**kwargs)
 36   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/ase/calculators/espresso.py", line 216, in __init__
 37     super().__init__(
 38   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/ase/calculators/genericfileio.py", line 336, in __init__
 39     raise EnvironmentError(f'No configuration of {template.name}')
 40 ase.calculators.calculator.EnvironmentError: No configuration of espresso
 41 '
 42 """
 43 
 44 The above exception was the direct cause of the following exception:
 45 
 46 Traceback (most recent call last):
 47   File "/raven/ptmp/hjung/GAP/scratch/unkownhost-_home_hjung/run_eval_dft_chunk_0_dfzhb4Sm89qkJVMNcICoHzKH9gGfe3KPYGE3Vecnk_8=_c5vhqbuu/_expyre_script_core.py", line 9, in <module>
 48     results = function(*args, **kwargs)
 49   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/autoparallelize/pool.py", line 157, in do_in_pool
 50     for result_group in results:
 51   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/multiprocessing/pool.py", line 870, in next
 52     raise value
 53 ValueError: Failed to construct calculator, original attempt's exception was '(exc)
 54 Traceback (most recent call last):
 55   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/generic.py", line 49, in _run_autopara_wrappable
 56     calculator_default = construct_calculator_picklesafe(calculator) 57   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/utils/parallel.py", line 51, in construct_calculator_picklesafe
 58     return calculator[0](*c_args, **c_kwargs)
 59   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/espresso.py", line 88, in __init__
 60     super().__init__(keep_files=keep_files, rundir_prefix=rundir_prefix,
 61   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/wfl/calculators/wfl_fileio_calculator.py", line 48, in __init__
 62     super().__init__(**kwargs)
 63   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/ase/calculators/espresso.py", line 216, in __init__
 64     super().__init__(
 65   File "/u/hjung/conda-envs/wfl_test/lib/python3.9/site-packages/ase/calculators/genericfileio.py", line 336, in __init__
 66     raise EnvironmentError(f'No configuration of {template.name}')
 67 ase.calculators.calculator.EnvironmentError: No configuration of espresso
 68 '
bernstei commented 4 months ago

How are you passing the pw.x command to the calculator constructor?

And can you confirm that you can manually create an Espresso calculator (outside of wfl) using the arguments (positional or kwargs) you're passing the calculator constructor you're trying to use in wfl?

[edited] the ASE Espresso calculator switched from a command keyword arg to an EspressoProfile, which the wrapper reconstructs from the calc_exec argument. It's possible that if you're passing a command but the wrapper is detecting that you have a version that supports the profile, it's not handling that combination well]

bernstei commented 4 months ago

@jungsdao If you can answer the questions in my previous post, we can hopefully fix this. I suspect a conflict between the different ways of passing the executable to Espresso.

jungsdao commented 4 months ago

I used to pass pw.x command via environ variable in slurm submission script. export ASE_ESPRESSO_COMMAND='srun /u/hjung/Softwares/QE/qe-7.0/bin/pw.x -in PREFIX.pwi > PREFIX.pwo'

When I tried to execute ASE espresso outside of wfl, I got following error complaining profile

 11 Traceback (most recent call last):
 12   File "/raven/u/hjung/test/test.py", line 57, in <module>
 13     calc = Espresso(command=command, input_data=input_data, kpts=(4, 4, 1), pseudopotentials=psp)
 14   File "/u/hjung/conda-envs/mace_env/lib/python3.9/site-packages/ase/calculators/espresso.py", line 201, in __in    it__
 15     raise RuntimeError(compatibility_msg)
 16 RuntimeError: Espresso calculator is being restructured.  Please use e.g. Espresso(profile=EspressoProfile(argv=    ['mpiexec', 'pw.x'])) to customize command-line arguments.

Like you have explained it should definitely have to do with new profile argument required by new ASE espresso.

bernstei commented 4 months ago

OK. You should be able to get it to work by passing a new argument to the wfl.calculators.Espresso wrapper calc_exec = "srun /u/hjung/Softwares/QE/qe-7.0/bin/pw.x" (without the PREFIX stuff).

I'll also think about how to get it to work best with both the old and new syntax, if possible, but I think passing a command via the env var is more or less deprecated.

jungsdao commented 4 months ago

Just confirmed that adding calculator_exec" : "srun /u/hjung/Softwares/QE/qe-7.0/bin/pw.x" to QE kwargs do not cause the previous error.

bernstei commented 4 months ago

Just confirmed that adding calculator_exec" : "srun /u/hjung/Softwares/QE/qe-7.0/bin/pw.x" to QE kwargs do not cause the previous error.

OK - I'll see what I can do to make things internally consistent, and then merge the PR

bernstei commented 4 months ago

I think I have a solution that will at least give clearer error messages. I'll merge as soon as I push and tests pass.

bernstei commented 4 months ago

closed by #294