Allow models to request additional input data from the engine

bananenpampe commented 1 month ago

Many ase claculators use additional information of the ase.Atoms object, such as total charge, magnetic moment and other arbitrary properties from the .info and .arrays dict. There should be a standardized way to define, what additional properties get read from the ase.Atoms object in the atomistic.ase_calculator.MetatensorCalculator, and then converted into a TensorBlock which gets passed wih the systems object.

For simplicity, I would propose that there should be two generic options that either access the properties from the .info and .arrays dict, plus predefined options for: get_initial_magnetic_moments() and get_initial_charges().

In principle this could be optional (so if the field/info does not exist, nothing will get parsed instead of rasining an Exception), should this be handeled on the ase.calculator side, or on the specific model side?

Maybe we could provide an additional parse_properties kwargs in the ase_calculator.MetatensorCalculator and then handle the extraction logic here: https://github.com/lab-cosmo/metatensor/blob/b34c0f3757b95cd85e04ce8cf468499e06a5a326/python/metatensor-torch/metatensor/torch/atomistic/ase_calculator.py#L211

bananenpampe commented 1 month ago

I am happy to work on this on my own once we have fixed the API, we need it urgently ^^

Luthaf commented 1 month ago

So this is something we will want to do, but doing it properly will take time. If you need something urgently, we can do a temporary branch with it just for you =).

We already have a mechanism to add extra data on a system from the engine and retrieve it from the model, through System.add_data and System.get_data. What we are missing is a way for the model to request specific data from the engine (which can be other things than ASE!)

I can see something like this working, with a clear standard definition for what different extra data means (like we are standardizing the model outputs)

# model export

model = ...

capabilities = ModelCapabilities(
     # as usual
     extra_system_data=["charge", "..."] # this would be added
)

# model definintion
def forward(systems, ...):
     charges = system.get_data("charges") # error if the data was not requested

bananenpampe commented 1 month ago

What speaks against having a bespoke ase calculator option implementation, in which the ase calculator writes to system data?

Luthaf commented 1 month ago

I would rather not put some code in the main branch which will be removed and changed once we have a solution for this. But we can do this in a temporary branch so you can go on with the scientific project while we figure the general solution!

bananenpampe commented 1 month ago

I do not need any of the calculators for the scientific part. I would prefer to have it as part of the metatensor releases, because it should be easily pip installable for users. Happy to close the issue if there is no interest for an ase side implementation.

Luthaf commented 1 month ago

There is interest for a general mechanism that ASE will also use, so let's keep this open.

Luthaf commented 1 month ago

Some more details on how this could work: we would add an extra field in ModelCapabilities called something like extra_input, that would be a Dict[str, ModelOutput], describing everything the model want as extra input.

The engine would then get this field from the model, and following some specification (like the one we have for the outputs), store the corresponding data in the systems. Here, we could do the same as for outputs, and have "standard" extra input data, and allow users to do whatever as long as they add a namespace somewhere (i.e. my_model::custom_input).

The model can then access the data in the system.

Unresolved questions

should we rename ModelCapabilities to something like ModelSpecification since it will contain more than what the model can do?
should we rename ModelOutput to something like Quantity/Property/Data, since it is not only about outputs?
Dict[str, ModelOutput] is currently used to describe a bunch of TensorMap (the model outputs). Here it would describe a bunch of TensorBlock in the system data. Should we use TensorMap for everything?

lab-cosmo / metatensor

Allow models to request additional input data from the engine #682

Unresolved questions