chiang-yuan commented 3 months ago

Introduction

There have been an emerging trend and great interest in foundation machine learning interatomic potentials (MLIPs/MLFFs) for simulating atomistic systems close to density functional theory (DFT) accuracy, either for universal (supporting roughly >80 elements across ptable) or tailored chemical systems. Some of the best open-source universal MLIPs currently in the game are

MACE-MP (equivariant GNNs) [arxiv, code]
CHGNet (invariant GNNs) [paper, code]
M3GNet (invariant GNNs) [paper, code]

We have seen a surge in literature trying to benchmark these three models. However, the unified and fair benchmark is missing and people not familiar with the implementation and training details of these models oftentimes have been trying to publish misleading results. The community needs a new collections of tasks, datasets, and spaces for unified/fast inference and fair benchmark that is not subject to

Unfair/imbalanced training datasets for different models (e.g. some of the current best open-source dataset MP, MPtrj vs proprietary datasets used by GNoME, Matlantis)
Different training configurations/resources used for pretrained checkpoints. In particular, some people might try to beat other models by overfitting one model but underfitting others in their reports. Or in general not consider the design difference or dataset inconsistencies between the models.
Different machines used for inference and benchmark. One key factor of atomistic simulations, both single point calculations and molecular dynamics, is the inference speed. The faster a MLIP, the more/longer the simulations can be done. This is important for large-scale (both in time and length) simulations for materials physics and chemistry. To fairly judge the speed performance of each model, we need to run all the models on the same machine and GPUs.

Some existing example benchmarks that worth notice are:

matbench-discovery for materials discovery
OCP for catalysis
Example MD benchmarks and zero-shot prediction tasks implemented in MACE-MP

🤗 Task and Transformer Implementation

Input

Huggingface transformer input will only intake a batch of ASE Atoms (list[ase.Atoms]) or a pytorch geometric Data (torch_geometric.data.Data). Each model/checkpoint submission (in pytorch .pt or equivalent in flax, tensorflow) needs to implement data interconversion and interface with their core architecture, and take care of graph/edge/neighborlist generation using ase or matscipy. We won't allow any other dependencies or customed package installation. The submitted models/checkpoints will be converted into standalone hgf transformers.

Output

Commonly used outputs should obey consensus (probably in ASE convention) and be in a list of JSON/dict. For examples,

[
{"energy": -12.3, 
"forces": [[0.1, -0.7, 0.23], [0.3, -0.27, 0.29], [-0.06, -0.457, 0.23], [0.1, -0.7, 0.23]], 
"stress": [[1.11, 0.0, 0.0], [0.0, 1.12, 0.01], [0, 0.01, 1.10], [0.1, -0.7, 0.23]], 
"charges": [1.1, 0, -1.1], "dipoles": [], "magmoms": []},
{"energy": -12.3, 
"forces": [[0.1, -0.7, 0.23], [0.3, -0.27, 0.29], [-0.06, -0.457, 0.23], [0.1, -0.7, 0.23]], 
"stress": [[1.11, 0.0, 0.0], [0.0, 1.12, 0.01], [0, 0.01, 1.10], [0.1, -0.7, 0.23]], 
"charges": [1.1, 0, -1.1], "dipoles": [], "magmoms": []},
]

Wauplin commented 3 months ago

Hi @chiang-yuan, thanks for the detailed explanations! I have a few questions that comes to mind :

what are your expectations from a Hub perspective? and from a library/tools perspective?
you mentioned that atomistic simulations can fit in the transformers framework. Does that mean that could be a new pipeline?
benchmark-wise we have a template to host reproducible leaderboards as Spaces on the Hub. Would you be interested in that direction?
are you mentioning that you'd like to see it listed as a task on https://huggingface.co/tasks? We usually start to do that once there is a consistent ecosystem in the domain.

I'm asking this to figure out the scope you had in mind so that I can redirect you to resources and/or ping the correct people internally :hugs:

chiang-yuan commented 3 months ago

Thanks @Wauplin ! See below my response

what are your expectations from a Hub perspective? and from a library/tools perspective?

We might need a new 🤗 task for this specialized input and output.

you mentioned that atomistic simulations can fit in the transformers framework. Does that mean that could be a new pipeline?

I am not familiar with pipeline but we might need a class that allow to inject models saved in torch.nn.Module or other relevant formats.

benchmark-wise we have a template to host reproducible leaderboards as Spaces on the Hub. Would you be interested in that direction?

Yes I am looking into building a Space for leaderboard, but realize the first step is we need a task

are you mentioning that you'd like to see it listed as a task on https://huggingface.co/tasks? We usually start to do that once there is a consistent ecosystem in the domain. We already have clearly defined benchmarks and pretrained models and datasets but it seems very incompatible with current tasks available on hf. If you could point/help us how to build and deploy a leaderboard incorporated with hf dataset and hf models without adding a task first, then we could talk about hosting a small independent ecosystem.

Thanks so much for your help :)

Wauplin commented 3 months ago

Hi @chiang-yuan, sorry for the late reply. So I think there is a misunderstanding about what the Hugging Face Hub is and allows. The Hub is a place where anyone can host any ML model of there choice, no matter the task, architecture, size, etc. There is indeed a list of "official" tasks here but this list is not exhaustive and only based on what we/the community agree the most for what should be "a task".

If you want to start harmonization work for a new task atomistic simulations, you shouldn't need our help to get started :) For what I understand, you are planning to build a new library to define what is a atomistic simulation model and what should be its inputs and outputs, right? To do so, I would advice to use the PytorchModelHubMixin object. It's a class that you can inherit from in your own library and which adds built-in from_pretrained, save_pretrained and push_to_hub methods. Here is how it would look like in practice:


import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin

class AtomisticSimulationModel(
        nn.Module,
        PyTorchModelHubMixin,
        library_name="???",
        tags=["atomistic-simulation"],
        repo_url="???",
        docs_url="???",
    ):
    def __init__(self, hidden_size: int = 512, vocab_size: int = 30000, output_size: int = 4):
        super().__init__()
        self.param = nn.Parameter(torch.rand(hidden_size, vocab_size))
        self.linear = nn.Linear(output_size, vocab_size)
        # implement custom logic here 

    def forward(self, x):
        return self.linear(x + self.param)
        # implement custom logic here

    # implement custom methods here

then you will be able to do:

model = AtomisticSimulationModel(hidden_size=256)

# Save model weights to local directory
model.save_pretrained("my-awesome-model")

# Push model weights to the Hub
model.push_to_hub("my-awesome-model")

# Download and initialize weights from the Hub
model = AtomisticSimulationModel.from_pretrained("username/my-awesome-model")
model.hidden_size # 256

All of this can be built with huggingface_hub and hosted on the Hub without custom backend implementation on our side. All models pushed with this class will be tagged as atomistic-simulation and therefore easily searchable on the Hub. You will also have download counts, model cards, etc. You can implement all the custom logic in AtomisticSimulationModel and especially work on the correct input/output format that you expect people to use. Once the project gets mature, we can think of integrating it more deeply in the platform, especially to officially support the library with integrated code snippets for instance.

What do you think? Is this something you are looking for? Happy to answer any further questions :)

chiang-yuan commented 3 months ago

Thank you @Wauplin for the exhaustive answer! Your explanation clears a lot of my confusion from the documentation.

Yes - I think PyTorchModelHubMixin is exactly what I would picture how the MLIP community will archive and deploy models on huggingface. Being searchable on Hugging Face as tag sounds also very nice. I think at this stage this implementation would be sufficient :)

I guess this is still not enough to become an inference endpoint? It will be nice to have it so people can run simulations as API calls. But of course we could talk about it later once we really see more demands.

chiang-yuan commented 3 months ago

Hi @Wauplin, is there any way to access the revision/version/commit hash for the uploaded model through push_to_hub or any file loaded through the following example code?

fpath = hf_hub_download(
            repo_id="organization/hf_repo",
            subfolder="pretrained",
            filename="pretrained.model",
            revision=None # None if not known
        )
model = AtomisticSimulationModel.from_pretrained(fpath)

# not sure about the following

version = model._hub_mixin_config["revision"]

Wauplin commented 3 months ago

I guess this is still not enough to become an inference endpoint? It will be nice to have it so people can run simulations as API calls. But of course we could talk about it later once we really see more demands.

Indeed, support for InferenceEndpoints would require some more work -still doable-. We can discuss this specific topic later :)

Wauplin commented 3 months ago

is there any way to access the revision/version/commit hash for the uploaded model through push_to_hub or any file loaded through the following example code?

Ideally, what you would like to do is AtomisticSimulationModel.from_pretrained("organization/hf_repo") directly and let the mixin do the work for you. If the model has been uploaded via the mixin, it should work out of the box. By default, the None revision is the latest commit on the main branch. If you haven't specified any custom revision when pushing the model to the Hub, then you don't have to pass it when loading it back.

version = model._hub_mixin_config["revision"]

Ideally you should never access _hub_mixin_config attribute by yourself. If you need a value from this config in your class, you should just add the argument name in your __init__ method.

chiang-yuan commented 3 months ago

Thanks @Wauplin ! Sorry for not being clear. I understand it is easier to use mixin, but I actually want to get the revision from the retrieved model after AtomisticSimulationModel.from_pretrained("organization/hf_repo"). I looked into the doc and attribute suggestion but haven't found one that looks promising

Wauplin commented 3 months ago

Oh I see. Then this is not something doable unfortunately. What you can do however is:

from huggingface_hub import model_info

# Commit hash for latest revision on main
model_info("organization/hf_repo").sha

# Commit hash for revision "refs/pr/1" (pull request ref)
model_info("organization/hf_repo", revision="refs/pr/1").sha

# Commit hash for revision "my-custom-branch-or-tag"
model_info("organization/hf_repo", revision="my-custom-branch-or-tag").sha

This is not built-in the mixin but would definitely work. Out of curiosity, what's the purpose of getting the commit hash that has been loaded?

chiang-yuan commented 2 months ago

Hi @Wauplin sorry for late reply. I would like to use the commit hash as a uuid for pretrained checkpoints

Wauplin commented 2 months ago

Ok, then using model_info separately would work. I don't see a way of doing it out of the box when loading with the mixin (users usually don't care about the commit hash^^). So would this solution be ok for you?

chiang-yuan commented 2 months ago

Yes this works great! Thanks!

Wauplin commented 2 months ago

I'm closing this issue as the discussion is finished I think. If you have any question in the future, please reopen a new one :)

huggingface / huggingface_hub

Adding a new task "SciML/Physical Science/AI4Science: Atomistic Simulations" #2123

Introduction

🤗 Task and Transformer Implementation

Input

Output