huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.12k stars 556 forks source link

Model card metadata format is not preserved when loaded + saved again #2564

Closed Wauplin closed 1 month ago

Wauplin commented 1 month ago

Here is an example: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct/commit/846357c7ee5e3f50575fd4294edb3d898c8ea100.

Let's try to find a way to improve this, especially for fields that haven't changed. Can't promise there is a good solution though.

Related to slack thread (private).

julien-c commented 1 month ago

yes in the initial implementation i was trying to do that. At least you should keep the attributes order.

hlky commented 1 month ago
diff --git a/src/huggingface_hub/repocard.py b/src/huggingface_hub/repocard.py
index f6ae591f..12c3e84e 100644
--- a/src/huggingface_hub/repocard.py
+++ b/src/huggingface_hub/repocard.py
@@ -109,7 +109,9 @@ class RepoCard:
             data_dict = {}
             self.text = content

-        self.data = self.card_data_class(**data_dict, ignore_metadata_errors=self.ignore_metadata_errors)
+        self.data = self.card_data_class(
+            **data_dict, ignore_metadata_errors=self.ignore_metadata_errors, original_order=list(data_dict.keys())
+        )

     def __str__(self):
         return self.content
diff --git a/src/huggingface_hub/repocard_data.py b/src/huggingface_hub/repocard_data.py
index b9b93aac..6b41cc57 100644
--- a/src/huggingface_hub/repocard_data.py
+++ b/src/huggingface_hub/repocard_data.py
@@ -172,8 +172,12 @@ class CardData:
     inherit from `dict` to allow this export step.
     """

-    def __init__(self, ignore_metadata_errors: bool = False, **kwargs):
+    def __init__(self, ignore_metadata_errors: bool = False, original_order: Optional[List[str]] = None, **kwargs):
         self.__dict__.update(kwargs)
+        if original_order:
+            self.__dict__ = {
+                k: self.__dict__[k] for k in original_order + list(set(self.__dict__.keys()) - set(original_order))
+            }

     def to_dict(self) -> Dict[str, Any]:
         """Converts CardData to a dict.
@@ -316,6 +320,7 @@ class ModelCardData(CardData):
         pipeline_tag: Optional[str] = None,
         tags: Optional[List[str]] = None,
         ignore_metadata_errors: bool = False,
+        original_order: Optional[List[str]] = None,
         **kwargs,
     ):
         self.base_model = base_model
@@ -347,7 +352,7 @@ class ModelCardData(CardData):
                         " some information will be lost. Use it at your own risk."
                     )

-        super().__init__(**kwargs)
+        super().__init__(**kwargs, original_order=original_order)

         if self.eval_results:
             if isinstance(self.eval_results, EvalResult):

Something like this, WDYT?

from huggingface_hub import ModelCard

model_card = """---
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
pipeline_tag: text-generation
tags:
- facebook
- meta
- pytorch
- llama
- llama-3
license: llama3.1
extra_gated_prompt: >-
  ### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
    - Student
    - Research Graduate
    - AI researcher
    - AI developer/engineer
    - Reporter
    - Other
  geo: ip_location
  By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
  The information you provide will be collected, stored, processed and shared in
  accordance with the [Meta Privacy
  Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
library_name: transformers
---
"""

card = ModelCard(model_card)
card.content

Currently returns:

"---\nlanguage:\n- en\n- de\n- fr\n- it\n- pt\n- hi\n- es\n- th\nlibrary_name: transformers\nlicense: llama3.1\npipeline_tag: text-generation\ntags:\n- facebook\n- meta\n- pytorch\n- llama\n- llama-3\nextra_gated_prompt: '### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT'\nextra_gated_fields:\n  First Name: text\n  Last Name: text\n  Date of birth: date_picker\n  Country: country\n  Affiliation: text\n  Job title:\n    type: select\n    options:\n    - Student\n    - Research Graduate\n    - AI researcher\n    - AI developer/engineer\n    - Reporter\n    - Other\n  geo: ip_location\n  ? By clicking Submit below I accept the terms of the license and acknowledge that\n    the information I provide will be collected stored processed and shared in accordance\n    with the Meta Privacy Policy\n  : checkbox\nextra_gated_description: The information you provide will be collected, stored, processed\n  and shared in accordance with the [Meta Privacy Policy](https://www.facebook.com/privacy/policy/).\nextra_gated_button_content: Submit\n---\n"

With patch original order is maintained:

"---\nlanguage:\n- en\n- de\n- fr\n- it\n- pt\n- hi\n- es\n- th\npipeline_tag: text-generation\ntags:\n- facebook\n- meta\n- pytorch\n- llama\n- llama-3\nlicense: llama3.1\nextra_gated_prompt: '### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT'\nextra_gated_fields:\n  First Name: text\n  Last Name: text\n  Date of birth: date_picker\n  Country: country\n  Affiliation: text\n  Job title:\n    type: select\n    options:\n    - Student\n    - Research Graduate\n    - AI researcher\n    - AI developer/engineer\n    - Reporter\n    - Other\n  geo: ip_location\n  ? By clicking Submit below I accept the terms of the license and acknowledge that\n    the information I provide will be collected stored processed and shared in accordance\n    with the Meta Privacy Policy\n  : checkbox\nextra_gated_description: The information you provide will be collected, stored, processed\n  and shared in accordance with the [Meta Privacy Policy](https://www.facebook.com/privacy/policy/).\nextra_gated_button_content: Submit\nlibrary_name: transformers\n---\n"

and after changes:

card = ModelCard(model_card)
card.data.license = "test"
card.content
"---\nlanguage:\n- en\n- de\n- fr\n- it\n- pt\n- hi\n- es\n- th\npipeline_tag: text-generation\ntags:\n- facebook\n- meta\n- pytorch\n- llama\n- llama-3\nlicense: test\nextra_gated_prompt: '### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT'\nextra_gated_fields:\n  First Name: text\n  Last Name: text\n  Date of birth: date_picker\n  Country: country\n  Affiliation: text\n  Job title:\n    type: select\n    options:\n    - Student\n    - Research Graduate\n    - AI researcher\n    - AI developer/engineer\n    - Reporter\n    - Other\n  geo: ip_location\n  ? By clicking Submit below I accept the terms of the license and acknowledge that\n    the information I provide will be collected stored processed and shared in accordance\n    with the Meta Privacy Policy\n  : checkbox\nextra_gated_description: The information you provide will be collected, stored, processed\n  and shared in accordance with the [Meta Privacy Policy](https://www.facebook.com/privacy/policy/).\nextra_gated_button_content: Submit\nlibrary_name: transformers\n---\n"

Would also need adding to DatasetCardData.