Model Card: Allow for dicts in `datasets` and `base_model` and also update spec

mofosyne commented 3 months ago

Is your feature request related to a problem? Please describe.

Was working on https://github.com/ggerganov/llama.cpp/pull/8875 to integrate some changes to how we interpret parent models and datasets into GGUF metadata and was alerted that your code currently interprets the datasets as only List[str] while the changes we are proposing would support these types in datasets and base_model :

List[str] of hugging face id
List[str] of urls to other repos
List[dict] of dict with fields like name, author, version, organization, url, doi, uuid and repo_url

Describe the solution you'd like

Update description to indicate support for urls and dict metadata in both datasets and base_model entry in model card as well as update typechecks to support dict as an option.

Describe alternatives you've considered

We already can support these extra metadata in GGUF file format via metadata override files, but it would be nice to be able to sync these feature so we can more easily grab these information from model creator's model card.

Additional context

The code area I'm looking at is https://github.com/huggingface/huggingface_hub/blob/e9cd695d7bd9e81b4eceb8f4da557a0cfa387b99/src/huggingface_hub/repocard_data.py#L249-L251

Wauplin commented 3 months ago

Hi @mofosyne, thanks for raising the topic. Unfortunately, this is not an easy constraint to lift. It is not only a matter of type annotations but of server-side constraints. You can see it more as "naming convention" rather that a hard technical constraint. The problem of lifting this limit is that we would have to update how we consume this fields in many places in HF ecosystem. Also, since we ensure specific types for model card metadata, third-party libraries and users are also relying on us to not break things over time. Supporting both dictionaries and lists for this field would be a big breaking change unfortunately.

cc @julien-c

julien-c commented 3 months ago

Yes agree with @Wauplin. For your use case @mofosyne you could add your own metadata property no? (and we can even add built-in support for it if a standard emerges)

mofosyne commented 2 weeks ago

Thanks. Merged in now. We will be sticking to these fields for the detailed dicts representation

base_model_sources (List[dict], optional)
dataset_sources (List[dict], optional)

So hence something like this (Note: Dummy data provided by chatgpt for illustrative purpose only):

base_model_sources:
  - name: "GPT-3"
    author: "OpenAI"
    version: "3.0"
    organization: "OpenAI"
    description: "A large language model capable of performing a wide variety of language tasks."
    url: "https://openai.com/research/gpt-3"
    doi: "10.5555/gpt3doi123456"
    uuid: "123e4567-e89b-12d3-a456-426614174000"
    repo_url: "https://github.com/openai/gpt-3"

  - name: "BERT"
    author: "Google AI Language"
    version: "1.0"
    organization: "Google"
    description: "A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks."
    url: "https://github.com/google-research/bert"
    doi: "10.5555/bertdoi789012"
    uuid: "987e6543-e21a-43f3-a356-527614173999"
    repo_url: "https://github.com/google-research/bert"

dataset_sources:
  - name: "Wikipedia Corpus"
    author: "Wikimedia Foundation"
    version: "2021-06"
    organization: "Wikimedia"
    description: "A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks."
    url: "https://dumps.wikimedia.org/enwiki/"
    doi: "10.5555/wikidoi234567"
    uuid: "234e5678-f90a-12d3-c567-426614172345"
    repo_url: "https://github.com/wikimedia/wikipedia-corpus"

  - name: "Common Crawl"
    author: "Common Crawl Foundation"
    version: "2021-04"
    organization: "Common Crawl"
    description: "A dataset containing web-crawled data from various domains, providing a broad range of text."
    url: "https://commoncrawl.org"
    doi: "10.5555/ccdoi345678"
    uuid: "345e6789-f90b-34d5-d678-426614173456"
    repo_url: "https://github.com/commoncrawl/cc-crawl-data"

Will fill in these metadata field in the gguf key value store.

general.base_model.count
general.base_model.{id}.name
general.base_model.{id}.author
general.base_model.{id}.version
general.base_model.{id}.organization
general.base_model.{id}.description
general.base_model.{id}.url
general.base_model.{id}.doi
general.base_model.{id}.uuid
general.base_model.{id}.repo_url

general.dataset.count
general.dataset.{id}.name
general.dataset.{id}.author
general.dataset.{id}.version
general.dataset.{id}.organization
general.dataset.{id}.description
general.dataset.{id}.url
general.dataset.{id}.doi
general.dataset.{id}.uuid
general.dataset.{id}.repo_url

julien-c commented 2 weeks ago

cool @mofosyne – thanks for linking https://github.com/ggerganov/llama.cpp/pull/8875

Do you have models on the HF Hub using this convention already? We can add validation so the types are hinted to be correct and we monitor how usage grows

Let's track how usage grows!

mofosyne commented 2 weeks ago

The feature hasn't been advertised anywhere at this stage... Will need to figure out the documentation next.

But in the meantime, I'll also need to figure out the most canonical form that would best fit your current model card parameters. This is because our model card parser is pretty forgiving of the various ways people randomly enter their parameter. (Plus at the time I didn't realize you defined it here in the source code).

On studying your current code base, I noticed you used model_name rather than name as I would have expected. So I appended the model_* to most of the parameters except for 'license', 'tags', 'pipeline_tag' and 'language' to keep with the same pattern.

If so then this is what I think your extended model card may look like. If you change model_name to name in your side, then it would make sense to remove the 'model_*' parameter pattern. But either way works for me.

If you are happy with the above, then I'll update the documentation to match and you can sync to that when it gets popular.

# Model Card Fields
model_name: Example Model Six
model_author: John Smith
model_version: v1.0
model_organization: SparkExampleMind
model_description: This is an example of a model
model_quantized_by: Abbety Jenson
# Useful for cleanly regenerating default naming conventions
model_finetune: instruct
model_basename: llamabase
model_size_label: 8x2.3Q
# Licensing details
license: apache-2.0
license_name: 'Apache License Version 2.0, January 2004'
license_link: 'https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md'
# Model Location/ID
model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16/blob/main/README.md'
model_doi: 'doi:10.1080/02626667.2018.1560449'
model_uuid: f18383df-ceb9-4ef3-b929-77e4dc64787c
model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16'
# Model Source If Conversion
source_model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor/blob/main/README.md'
source_model_doi: 'doi:10.1080/02626667.2018.1560449'
source_model_uuid: 'a72998bf-3b84-4ff4-91c6-7a6b780507bc'
source_model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor'
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
  - name: GPT-3
    author: OpenAI
    version: '3.0'
    organization: OpenAI
    description:  A large language model capable of performing a wide variety of language tasks.
    url: 'https://openai.com/research/gpt-3'
    doi: 10.5555/gpt3doi123456
    uuid: 123e4567-e89b-12d3-a456-426614174000
    repo_url: 'https://github.com/openai/gpt-3'
  - name: BERT
    author: Google AI Language
    version: '1.0'
    organization: Google
    description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
    url: 'https://github.com/google-research/bert'
    doi: 10.5555/bertdoi789012
    uuid: 987e6543-e21a-43f3-a356-527614173999
    repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
  - name: Wikipedia Corpus
    author: Wikimedia Foundation
    version: 2021-06
    organization: Wikimedia
    description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
    url: 'https://dumps.wikimedia.org/enwiki/'
    doi: 10.5555/wikidoi234567
    uuid: 234e5678-f90a-12d3-c567-426614172345
    repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
  - name: Common Crawl
    author: Common Crawl Foundation
    version: 2021-04
    organization: Common Crawl
    description: A dataset containing web-crawled data from various domains, providing a broad range of text.
    url: 'https://commoncrawl.org'
    doi: 10.5555/ccdoi345678
    uuid: 345e6789-f90b-34d5-d678-426614173456
    repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
  - text generation
  - transformer
  - llama
  - tiny
  - tiny model
pipeline_tag:
  - text-classification
language:
  - en

(Note: I also noticed that 'pipeline_tag' and 'languages' is missing an 's' at the end... but that's a nitpick) (p.s. an idea to consider is to give brownie awards for repos with good metadata)

Wauplin commented 2 weeks ago

I feel that keeping model_name for consistency with the existing but then having all other fields as "raw" (author, version, organization, etc.) is better. I find the model_* everywhere to be very verbose. Also, similarly, I'd keep dataset_name but then do not prepend dataset_* everywhere.

Wauplin commented 2 weeks ago

Also, this new proposition in https://github.com/huggingface/huggingface_hub/issues/2479#issuecomment-2482798095 is adding way more fields to the model card than the suggestion in https://github.com/huggingface/huggingface_hub/issues/2479#issuecomment-2473101673. I think that adding base_model_sources and dataset_sources with defined specifications is fine but adding all the other fields (source_model_url, source_model, doi, model_doi, , model_uuid, model_quantized_by, model_finetune, model_organization, etc.) is too much and would bloat the model card metadata convention.

mofosyne commented 2 weeks ago

Ah I see. So the parent references should have more details for easier retrieval, but the model itself can be understood by context. Fair enough.

So if we don't dump out all the KV stuff, but just keep the stuff directly referenced in the current HF model card conventions (as defined in the python source)... plus the detailed parent model/datasets... this should look more like.

# Model Card Fields
model_name: Example Model Six
# Licensing details
license: apache-2.0
license_name: Apache License Version 2.0, January 2004
license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
  - name: GPT-3
    author: OpenAI
    version: '3.0'
    organization: OpenAI
    description:  A large language model capable of performing a wide variety of language tasks.
    url: 'https://openai.com/research/gpt-3'
    doi: 10.5555/gpt3doi123456
    uuid: 123e4567-e89b-12d3-a456-426614174000
    repo_url: 'https://github.com/openai/gpt-3'
  - name: BERT
    author: Google AI Language
    version: '1.0'
    organization: Google
    description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
    url: 'https://github.com/google-research/bert'
    doi: 10.5555/bertdoi789012
    uuid: 987e6543-e21a-43f3-a356-527614173999
    repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
  - name: Wikipedia Corpus
    author: Wikimedia Foundation
    version: '2021-06'
    organization: Wikimedia
    description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
    url: 'https://dumps.wikimedia.org/enwiki/'
    doi: 10.5555/wikidoi234567
    uuid: 234e5678-f90a-12d3-c567-426614172345
    repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
  - name: Common Crawl
    author: Common Crawl Foundation
    version: '2021-04'
    organization: Common Crawl
    description: A dataset containing web-crawled data from various domains, providing a broad range of text.
    url: 'https://commoncrawl.org'
    doi: 10.5555/ccdoi345678
    uuid: 345e6789-f90b-34d5-d678-426614173456
    repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
  - text generation
  - transformer
  - llama
  - tiny
  - tiny model
language:
  - en

#... other HF stuff here... but isn't present from the GGUF KV store...
#... so there is no direct analog... so will be omitted on llama.cpp side of the documentation...

Well @Wauplin this does indeed look a bit more compact now. FYI, this is just going to be documentation on our side for now. But just double checking that we won't be stepping on any toes. Thumbs up if all green.

(edit: removed pipeline_tag as I remembered it's not included in the GGUF)

Wauplin commented 2 weeks ago

Nice, I can confirm that this version is not stepping on anyone's toes! :+1: Gentle ping to @ggerganov @julien-c if you want to confirm the metadata described in makes sense to you as well so we can settle this for good.

julien-c commented 1 week ago

proposal looks good to me, but I would, whenever possible, also include our simpler base_model (array of model ids on the Hub) and datasets (array of dataset ids on the Hub) – whenever you know them – as we already have more built-in support for those

i.e. I would use the current proposal as an extension/add-on on top of existing conventional (simpler) metadata

mofosyne commented 1 week ago

Okay thanks. FYI placed the mapping to https://github.com/ggerganov/llama.cpp/wiki/HuggingFace-Model-Card-Metadata-Interoperability-Consideration for future reference now.

huggingface / huggingface_hub

Model Card: Allow for dicts in `datasets` and `base_model` and also update spec #2479