Structure optimization fails after first step

mieand commented 7 years ago

Hi,

I have an issue running ase-espresso on the superMUC system, which has appeared recently without any apparent change to my QE, ase or ase-espresso installations, but possibly software updates on superMUC. Python might have changed. I am currently using python/2.7_anaconda_nompi.

After running the first step of a structural relaxation, no more output is written to log, but the job doesn't fail and no error messages are written to the output. I was using a rather old version of ase (3.8.1) and therefore tried to update to the newest version (3.13.0). Same problem occurs, but in this case I do get an error message.

I don't think the problem is related to my QE installation (I am using espresso.5.1.r11289.pybeef), since structure optimization works fine without the ase-espresso interface.

I attach the submit script (temp123456.job), the job script (qn_job.py), the QE output (log) as well as the error and output messages for the attempt using ase v. 3.13.0.

Anyone knows why this happens?

ase-espresso_issue_files_2017.zip

lmmentel commented 7 years ago

Hi,

The problem with your update to ase 3.13.0 is that it expects the espresso calculator to have a parameters attribute that gets written to the trajectory file every time the structure gets dumped into a .traj file. ase-espresso is not up to date with ase 3.13.0 and this is precisely why you see the error in your err file.

I'm working on a fork of ase-espresso that tries to keep up with the developments of ase and it works with the current version 3.13.0 and python 3.6. It's pretty easy to install so you can give it a go if you want, but make sure to go through the README first.

Cheers

mieand commented 7 years ago

Hi, Thanks for this suggestion. Your fork does indeed solve the issue regarding the failure during structure optimization. However, I am missing some information how to configure your interface to run on SuperMUC. SuperMUC uses the LoadLeveler scheduler: https://www.lrz.de/services/compute/supermuc/loadleveler/ which doesn't seem to be covered by the SiteConfig class. When I submit a job it currently runs only on a single core of the node. Where is this behavior controlled?

lmmentel commented 7 years ago

You're right, SiteConfig does not support LoadLeveler yet simply because I don't have access to an IBM cluster. However this is something that can easily be fixed if you have your espsite.py. You could add a method to SiteConfig analogous to set_slurm_env and set_pbs_env which would extract the execution parameters for running QE.

It would be great if you could submit a pull request with a LL patch for SiteConfig. If you need some help with that create an new issue and paste your espsite.py then I can look into it.

mieand commented 7 years ago

Hi, Unfortunately, when I succeeded in running my job as a batch job, the issue regarding the failure during structure optimization was there again. So it seems that it's unrelated to the version of ase-espresso used.

vossjo commented 7 years ago

I have just committed the patches we have applied to init.py so that the interface can be used with ase 3.13 (including a few more vdW options, etc.) The fact that the optimization doesn't continue after the first step is indeed of different origin. This happens when the communication between pw.x and python doesn't work (after the first force calculation is the first time this communication has to happen). An update on the cluster could have caused this (some installations chop off stdio between mpi connections). It looks like the espsite.py has to be adjusted. You could use a fifo / named-pipe option (there are a few examples in this source distribution) that avoids using stdio for communication (and is still as fast). And thanks to Lukasz for the work on the alternative version of the interface that is more easily installed!

mieand commented 7 years ago

Thanks Johannes, problem solved. There was indeed an issue reading stdin with the default Intel MPI on superMUC. Loading a newer version solved the issue.

vossjo / ase-espresso

Structure optimization fails after first step #21