Closed chiang-yuan closed 2 months ago
Hi @chiang-yuan, thanks for the detailed explanations! I have a few questions that comes to mind :
atomistic simulations
can fit in the transformers
framework. Does that mean that could be a new pipeline?I'm asking this to figure out the scope you had in mind so that I can redirect you to resources and/or ping the correct people internally :hugs:
Thanks @Wauplin ! See below my response
- what are your expectations from a Hub perspective? and from a library/tools perspective?
We might need a new 🤗 task for this specialized input and output.
- you mentioned that
atomistic simulations
can fit in thetransformers
framework. Does that mean that could be a new pipeline?
I am not familiar with pipeline but we might need a class that allow to inject models saved in torch.nn.Module
or other relevant formats.
- benchmark-wise we have a template to host reproducible leaderboards as Spaces on the Hub. Would you be interested in that direction?
Yes I am looking into building a Space for leaderboard, but realize the first step is we need a task
- are you mentioning that you'd like to see it listed as a task on https://huggingface.co/tasks? We usually start to do that once there is a consistent ecosystem in the domain. We already have clearly defined benchmarks and pretrained models and datasets but it seems very incompatible with current tasks available on hf. If you could point/help us how to build and deploy a leaderboard incorporated with hf dataset and hf models without adding a task first, then we could talk about hosting a small independent ecosystem.
Thanks so much for your help :)
Hi @chiang-yuan, sorry for the late reply. So I think there is a misunderstanding about what the Hugging Face Hub is and allows. The Hub is a place where anyone can host any ML model of there choice, no matter the task, architecture, size, etc. There is indeed a list of "official" tasks here but this list is not exhaustive and only based on what we/the community agree the most for what should be "a task".
If you want to start harmonization work for a new task atomistic simulations
, you shouldn't need our help to get started :) For what I understand, you are planning to build a new library to define what is a atomistic simulation model and what should be its inputs and outputs, right? To do so, I would advice to use the PytorchModelHubMixin
object. It's a class that you can inherit from in your own library and which adds built-in from_pretrained
, save_pretrained
and push_to_hub
methods. Here is how it would look like in practice:
import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin
class AtomisticSimulationModel(
nn.Module,
PyTorchModelHubMixin,
library_name="???",
tags=["atomistic-simulation"],
repo_url="???",
docs_url="???",
):
def __init__(self, hidden_size: int = 512, vocab_size: int = 30000, output_size: int = 4):
super().__init__()
self.param = nn.Parameter(torch.rand(hidden_size, vocab_size))
self.linear = nn.Linear(output_size, vocab_size)
# implement custom logic here
def forward(self, x):
return self.linear(x + self.param)
# implement custom logic here
# implement custom methods here
then you will be able to do:
model = AtomisticSimulationModel(hidden_size=256)
# Save model weights to local directory
model.save_pretrained("my-awesome-model")
# Push model weights to the Hub
model.push_to_hub("my-awesome-model")
# Download and initialize weights from the Hub
model = AtomisticSimulationModel.from_pretrained("username/my-awesome-model")
model.hidden_size # 256
All of this can be built with huggingface_hub
and hosted on the Hub without custom backend implementation on our side. All models pushed with this class will be tagged as atomistic-simulation
and therefore easily searchable on the Hub. You will also have download counts, model cards, etc. You can implement all the custom logic in AtomisticSimulationModel
and especially work on the correct input/output format that you expect people to use. Once the project gets mature, we can think of integrating it more deeply in the platform, especially to officially support the library with integrated code snippets for instance.
What do you think? Is this something you are looking for? Happy to answer any further questions :)
Thank you @Wauplin for the exhaustive answer! Your explanation clears a lot of my confusion from the documentation.
Yes - I think PyTorchModelHubMixin
is exactly what I would picture how the MLIP community will archive and deploy models on huggingface. Being searchable on Hugging Face as tag sounds also very nice. I think at this stage this implementation would be sufficient :)
I guess this is still not enough to become an inference endpoint? It will be nice to have it so people can run simulations as API calls. But of course we could talk about it later once we really see more demands.
Hi @Wauplin, is there any way to access the revision/version/commit hash for the uploaded model through push_to_hub
or any file loaded through the following example code?
fpath = hf_hub_download(
repo_id="organization/hf_repo",
subfolder="pretrained",
filename="pretrained.model",
revision=None # None if not known
)
model = AtomisticSimulationModel.from_pretrained(fpath)
# not sure about the following
version = model._hub_mixin_config["revision"]
I guess this is still not enough to become an inference endpoint? It will be nice to have it so people can run simulations as API calls. But of course we could talk about it later once we really see more demands.
Indeed, support for InferenceEndpoints would require some more work -still doable-. We can discuss this specific topic later :)
is there any way to access the revision/version/commit hash for the uploaded model through push_to_hub or any file loaded through the following example code?
Ideally, what you would like to do is AtomisticSimulationModel.from_pretrained("organization/hf_repo")
directly and let the mixin do the work for you. If the model has been uploaded via the mixin, it should work out of the box. By default, the None
revision is the latest commit on the main
branch. If you haven't specified any custom revision when pushing the model to the Hub, then you don't have to pass it when loading it back.
version = model._hub_mixin_config["revision"]
Ideally you should never access _hub_mixin_config
attribute by yourself. If you need a value from this config in your class, you should just add the argument name in your __init__
method.
Thanks @Wauplin ! Sorry for not being clear. I understand it is easier to use mixin, but I actually want to get the revision from the retrieved model after AtomisticSimulationModel.from_pretrained("organization/hf_repo")
. I looked into the doc and attribute suggestion but haven't found one that looks promising
Oh I see. Then this is not something doable unfortunately. What you can do however is:
from huggingface_hub import model_info
# Commit hash for latest revision on main
model_info("organization/hf_repo").sha
# Commit hash for revision "refs/pr/1" (pull request ref)
model_info("organization/hf_repo", revision="refs/pr/1").sha
# Commit hash for revision "my-custom-branch-or-tag"
model_info("organization/hf_repo", revision="my-custom-branch-or-tag").sha
This is not built-in the mixin but would definitely work. Out of curiosity, what's the purpose of getting the commit hash that has been loaded?
Hi @Wauplin sorry for late reply. I would like to use the commit hash as a uuid for pretrained checkpoints
Ok, then using model_info
separately would work. I don't see a way of doing it out of the box when loading with the mixin (users usually don't care about the commit hash^^). So would this solution be ok for you?
Yes this works great! Thanks!
I'm closing this issue as the discussion is finished I think. If you have any question in the future, please reopen a new one :)
Introduction
There have been an emerging trend and great interest in foundation machine learning interatomic potentials (MLIPs/MLFFs) for simulating atomistic systems close to density functional theory (DFT) accuracy, either for universal (supporting roughly >80 elements across ptable) or tailored chemical systems. Some of the best open-source universal MLIPs currently in the game are
We have seen a surge in literature trying to benchmark these three models. However, the unified and fair benchmark is missing and people not familiar with the implementation and training details of these models oftentimes have been trying to publish misleading results. The community needs a new collections of tasks, datasets, and spaces for unified/fast inference and fair benchmark that is not subject to
Some existing example benchmarks that worth notice are:
🤗 Task and Transformer Implementation
Input
list[ase.Atoms]
) or a pytorch geometric Data (torch_geometric.data.Data
). Each model/checkpoint submission (in pytorch.pt
or equivalent in flax, tensorflow) needs to implement data interconversion and interface with their core architecture, and take care of graph/edge/neighborlist generation usingase
ormatscipy
. We won't allow any other dependencies or customed package installation. The submitted models/checkpoints will be converted into standalone hgf transformers.Output