Using BaseIterator for the EOS workflow

siesta-project / aiida_siesta_plugin

Source code for the AiiDA-Siesta package (plugin and workflows). See wiki

Other

6 stars 11 forks source link

Using BaseIterator for the EOS workflow #63

Closed pfebrer closed 4 years ago

pfebrer commented 4 years ago

I adapted the EOS workflow to work with BaseIterator.

By doing this, the main improvement exposed to the users at the moment is that you can choose the batch size, but there may be others if the BaseIterator gains functionality.

During the process, I've discovered things that might be important for the BaseIterator to be useful to developers for creating their own workchains. I mean, to do something more than:

class MyProcessIterator(BaseIterator):
    _process_class = MyProcess

One big change is that you can create iteration inputs using cls.iteration_input(). If I create an iteration input called "scales", for example, I allow users to do:

run(Process, scales=[1,2,3,4])

instead of

run(Process, iterate_over={"scales":[1,2,3,4]})

as they are exactly equivalent. This can help prevent very large nesting when you start to nest iterators, but more importantly, since an iteration input is a regular aiida input, it can be made required (as we need for the EOS chain, for example).

There's an extensive explanation in the docstring of cls.iteration_input.

Let me know what do you think!

NOTE: I commented out the part where scale_to_vol is executed because it is not working (even in the old version). It raises an error like Molecule has no attribute scale_lattice. Maybe there has been some change in pymatgen. Could you look into it? Thanks!

bosonie commented 4 years ago

Can you please push as well an example_eos.py so I can better see your API? Thanks! Don't worry about tests and so on. I'll look at it. I mean, I guess we now have the possibility to select the scale parameters in input?

pfebrer commented 4 years ago

I'm out of home now, I will push it this afternoon. But the old exemple should run fine, it works exactly the same. It's just that if you want you can use the "scales" input to customize the scales (which have a default) , and you have also the batch_size input available.

On Fri, Aug 21, 2020, 14:46 Emanuele Bosoni notifications@github.com wrote:

Can you please push as well an example_eos.py so I can better see your API? Thanks! Don't worry about tests and so on. I'll look at it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/albgar/aiida_siesta_plugin/pull/63#issuecomment-678271832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKA77ZNUGBAN7YQHT67XDF3SBZUC3ANCNFSM4QHHL3TA .

bosonie commented 4 years ago

Cool, I understand, no need then to push another example. I'll play around myself

bosonie commented 4 years ago

Ok, so basically the point here is that in our iterator implementation, the input iterate_over was always required and here the iteration parameter is selected by the user. However there are situation where:

We don't want to let the user to chose because it is obvious from the purpose of the workchain (child class of BaseIterator) what we want to iterate over.
We want to facilitate the passing of the quantity to iterate over excluding it from iterate_over dictionary but passing the list directly as an input port

Just one observation. I guess this approach has the danger to make confusion if we want to iterate over an input port of _process_class. I mean, if iteration_input(name) where name is also an exposed input of _process_class, this would result in two inputs port with the same name. The developer should be forced to exclude it from the exposed inputs of _process_class. Rest is good!

Regarding the scale_to_vol, there is an easy fix. Kust substitute get_pymatgen with get_pymatgen_structure. Can you do it for me and reinstate the call to the function? Thanks!!!

pfebrer commented 4 years ago

I mean, if iteration_input(name) where name is also an exposed input of _process_class, this would result in two inputs port with the same name. The developer should be forced to exclude it from the exposed inputs of _process_class.

So maybe raise an error if you are trying to create an iteration_input with a name that already exists? Isn't this already managed by aiida? I don't know, I'm asking, but it seems odd that a workchain lets you specify two inputs with the same name.

I thought about making all the inputs of _process_class iterable, in the sense that if a python list is passed for an input, it will be interpreted as an attempt to iterate over it, but I'm not sure if that could interfere with some functionality. What do you think? To be clear, I'm saying something like:

run = (SiestaIterator, structure=[struct1, struct2])

would automatically interpret that you want to iterate over the structure. Maybe it could be optional and be off by default, but it could be useful to prevent nesting and it seems to me that not many aiida inputs will accept python lists anyway so it's safe to do so.

Another approach could be to declare explicitly which are the inputs you are trying to iterate over, but I don't know. Something like:

run(SiestaIterator, structures=[struct1, struct2], iterate_over=["structures"])

Thoughts?

bosonie commented 4 years ago

So maybe raise an error if you are trying to create an iteration_input with a name that already exists? Isn't this already managed by aiida? I don't know, I'm asking, but it seems odd that a workchain lets you specify two inputs with the same name.

I think that AiiDA just reads sequentially and executes the last specification. There is no error. And this is the wanted behavior! In fact, for instance, you want to be able to redefine a port in a subclass. I would say we should not support the possibility to create an iteration_input with a name that already exists, to rise an error is fine. I guess this would already exclude the case when namespaces are defined. In fact I would allow:

run = (SiestaStructureIterator, structure=[struct1, struct2], siesta_inputs=**inputs)

but not what you suggested:

run = (SiestaIterator, structure=[struct1, struct2], **inputs)

I don't see any need for this. I think that our original API is perfectly fine! The use of iterate_over should remain the correct way to submit calculations in my opinion, use of iteration_input is only an additional possibility for peculiar cases.

pfebrer commented 4 years ago

As you proposed, we are raising an error now if someone tries to overwrite an input with an iteration_input.

Also, scale_to_vol is fixed.

During this, I also thought that the SiestaIterator itself could benefit from these iterations over scales. After thinking for a bit, I realised that any Iterator that accepts a structure as an input can benefit from it. Therefore, I moved the structure modification to utils.iterate_params (not the definitive location) to somehow start to build a more general concept that allows us to make any iterator be able to benefit from the code that can be shared.

You can see then that when defining SIESTA_ITERATION_PARAMS I added the structure modification parameters (right before the fdf ones), and then the EOS workchain can use them.

Any suggestion on whether you think this approach is useful or on how to actually implement it would be very welcome (this is just a first draft to show that it can be decoupled).

Also I added the possibility of not creating the iterate_over input port, by using cls._iterate_over_port = False. I am using it in the EOS workchain since we actually don't want people to iterate over anything other than structure_scales (I guess the equation fit only makes sense when everything else is fixed).

bosonie commented 4 years ago

I don't understand your decision to make Eos child of SietaIterator. It is useless for Eos to know all the _params_lookup, they won't be used anyway. The way was implemented before was much better and reusable by other plugin simply by changing _process_class.

I also understand the need to make a collection of _parsing_func that can be used across different iterators, but the way you implemented seems to me unnecessary complicated.

Honestly I would revert to the previous commit and ship the functions that modify the structure with the eos, exactly in the same way we ship the _parsing_func of SiestaIterator in the file containing the class. Then you can create a new PR where you put the iterate_params file and where we can discuss a way to make a collection of _parsing_func, divide them between siesta-specific and more general and so on. Ok?

Very well the _iterate_over_port = False and scale_to_vol, thanks!

pfebrer commented 4 years ago

I don't understand your decision to make Eos child of SietaIterator. It is useless for Eos to know all the _params_lookup, they won't be used anyway. The way was implemented before was much better and reusable by other plugin simply by changing _process_class.

Well I was thinking about this for a while, but I thought it was not that easy to reuse the workchain because:

It only works if the structure to perform the eos on is passed in inputs.structure
It does a specific check on some fdf parameters to make sure the cell can not vary.
If there has been a relaxation, it assumes that the final structure is in output_structure
It assumes that you can retreive energies from output_parameters.
It gets E_KS and E_KS_units from the output parameters dictionary.

but yeah I understand that you don't need all the parameters. However they don't interfere at all because you can not pass them. So at the end it is the same to do:

class GenericEOS(BaseIterator):
    ...

class SiestaEOS(GenericEOS):
    _process_class = SiestaBaseWorkChain
    _params_lookup = (...define the parameter parsing)
    _expose_inputs_kwargs = {'exclude': ('metadata',)}

and

class GenericEOS(BaseIterator?):
    ...

class SiestaEOS(SiestaIterator):
    pass

And you can reuse the EOS workchain in both cases. What's the main difference that you see? The advantage that I saw in my case was that I didn't have to define how to parse structure_scales twice (in SiestaIterator and the EOS chain) and also I didn't have to specify _expose_inputs_kwargs = {'exclude': ('metadata',)} again.

I also understand the need to make a collection of _parsing_func that can be used across different iterators, but the way you implemented seems to me unnecessary complicated.

That's why I asked for ideas to make it simpler :)

Honestly I would revert to the previous commit and ship the functions that modify the structure with the eos, exactly in the same way we ship the _parsing_func of SiestaIterator in the file containing the class. Then you can create a new PR where you put the iterate_params file and where we can discuss a way to make a collection of _parsing_func, divide them between siesta-specific and more general and so on. Ok?

Even if I make the EOS not inherit from SiestaIterator, I think that it is good that someone using the SiestaIterator can also use them (I mean, if we have already written the code what's the harm?), so why can't we still keep it in some external file?

The main goal of this PR for me was not to update the EOS workchain, but while doing so identify weak points of the functionality of BaseIterator and the general infrastructure for iterators and make it more robust. So in my opinion the discussion fits in this PR, otherwise we will close it without having fully acheived the changes required. But if you insist, I will open a new one.

Thanks for the feedback!

bosonie commented 4 years ago

I understand this. I was too optimistic in saying that to change _process_class was enough, but the list of changes to make is limited compared to the need of defining <YourPlugin>Iterator, the _lookup_params and so on.

In my opinion, in your approach, you load a lot of functions that you do not use. My idea is to make a collection of parsing_func but not to have a _params_lookup that contains all of them. The SiestaIterator can not be a universal iterator that can iterate over any possible parameter in this universe. We have to limit what SiestaIterator can do. Then, case by case, the user can create his own iterator and put in _params_lookup what he is interested in. For instance in SiestaIterator I'm interested in iterate over the main fdf parameters, in the EosIterator over the the scales factor. If I want both at the same time, I can create a new Iterator. To keep adding functionality to SiestaIterator adds confusion and is not sustainable in my opinion.

And you can reuse the EOS workchain in both cases. What's the main difference that you see? The advantage that I saw in my case was that I didn't have to define how to parse structure_scales twice (in SiestaIterator and the EOS chain) and also I didn't have to specify _expose_inputs_kwargs = {'exclude': ('metadata',)} again.

Maybe for my EoS I want to expose the metadata and use a namespace.

The main goal of this PR for me was not to update the EOS workchain, but while doing so identify weak points of the functionality of BaseIterator and the general infrastructure for iterators and make it more robust. So in my opinion the discussion fits in this PR, otherwise we will close it without having fully acheived the changes required. But if you insist, I will open a new one.

The only improvement I see is to have a place where to collect the parsing_func but I need to think what is the best way. Again in this I think the AiiDA core developers could help.

bosonie commented 4 years ago

The only improvement I see is to have a place where to collect the parsing_func but I need to think what is the best way. Again in this I think the AiiDA core developers could help.

Maybe a @calcfunction factory??

pfebrer commented 4 years ago

In my opinion, in your approach, you load a lot of functions that you do not use.

Yes, but is that a problem? I mean it is not really loading anything. The _params_lookup variable is in SiestaIterator and is not copied when you inherit from it, there's always only one instance of it. So, in fact, you are using less memory than if you create a specific _params_lookup for your new iterator.

The SiestaIterator can not be a universal iterator that can iterate over any possible parameter in this universe.

:laughing:

Why not? (this is a serious question). If something can be iterated in SIESTA, why not supporting it in the SiestaIterator? I think that by not putting it there, you are limiting the possibilities. Someone (e.g. me) could feel annoyed by the fact that the code to iterate over two different parameters has been written but you can not use them together because they are not implemented in the same iterator. You would have to figure out how have they been implemented and create a new iterator that understands how to parse both (and do this each time this situation happens).

So in my opinion

For instance in SiestaIterator I'm interested in iterate over the main fdf parameters, in the EosIterator over the the scales factor.

we should not assume what future users would use SiestaIterator for. For me this sentence should be: "In SiestaIterator I'm free to iterate over whatever I want, in the EosIterator it only makes sense to iterate over scales, so the rest of parameters are blocked."

I agree that it is not straightforward to make it maintainable, there needs to be good organization and it has to be thought ahead. For example, what I did for structure modifications: if there is an input that accepts a structure and it's called struct, then all the parameters that are modifications of this structure will be struct_<parameter>. I believe that this is perfectly maintainable, as the "namespace" used by these parameters is perfectly defined. Even if there are two input structures struct and reference, you could still iterate over modifications of both structures at the same time without ambiguity.

Maybe a @calcfunction factory??

Yes! but I think it's not enough. I think it's powerful to be able to create "parsing groups" like the "Structures modification". Otherwise you need to select the calcfunctions one by one. Also it is not straightforward because some general parsing is needed before using the actual @calcfunction (see modify_structure in the iterate_params file).

pfebrer commented 4 years ago

Maybe for my EoS I want to expose the metadata and use a namespace.

Ok, in that case you redefine _exclude_params_lookup, but you'd have to do it with your approach as well, wouldn't you? :)

bosonie commented 4 years ago

I see very difficult to maintain something like that. I'm considering for instance the fact that soon we will add support for LUA files. In your idea, every possible functionality of lua should also become a parameter to iterate over. It is, in general, very risky to say that everything is supported when, potentially, there is an infinite spectrum of possibilities.

Anyway, I guess the main point here is that we do not agree. I foresee a long discussion and a even longer development if we go in your direction. For this I believe it is better to close this PR reverting to previous commit and open new one. I would like to have the EoS as an example of the use of an iterator and then rethink the rest.

pfebrer commented 4 years ago

I moved back the structure modifications only to the EOS workchain, although I honestly think that by doing this we are missing some potential for the SiestaIterator :).

I also added a first draft of automatic description of the parameters in an iterator, as we discussed. You can check it out by doing:

from aiida_siesta.workflows.iterate import SiestaIterator

print(SiestaIterator.parameters_help)

Cheers!