Open mofosyne opened 3 months ago
Hi @mofosyne, thanks for raising the topic. Unfortunately, this is not an easy constraint to lift. It is not only a matter of type annotations but of server-side constraints. You can see it more as "naming convention" rather that a hard technical constraint. The problem of lifting this limit is that we would have to update how we consume this fields in many places in HF ecosystem. Also, since we ensure specific types for model card metadata, third-party libraries and users are also relying on us to not break things over time. Supporting both dictionaries and lists for this field would be a big breaking change unfortunately.
cc @julien-c
Yes agree with @Wauplin. For your use case @mofosyne you could add your own metadata property no? (and we can even add built-in support for it if a standard emerges)
Thanks. Merged in now. We will be sticking to these fields for the detailed dicts representation
So hence something like this (Note: Dummy data provided by chatgpt for illustrative purpose only):
base_model_sources:
- name: "GPT-3"
author: "OpenAI"
version: "3.0"
organization: "OpenAI"
description: "A large language model capable of performing a wide variety of language tasks."
url: "https://openai.com/research/gpt-3"
doi: "10.5555/gpt3doi123456"
uuid: "123e4567-e89b-12d3-a456-426614174000"
repo_url: "https://github.com/openai/gpt-3"
- name: "BERT"
author: "Google AI Language"
version: "1.0"
organization: "Google"
description: "A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks."
url: "https://github.com/google-research/bert"
doi: "10.5555/bertdoi789012"
uuid: "987e6543-e21a-43f3-a356-527614173999"
repo_url: "https://github.com/google-research/bert"
dataset_sources:
- name: "Wikipedia Corpus"
author: "Wikimedia Foundation"
version: "2021-06"
organization: "Wikimedia"
description: "A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks."
url: "https://dumps.wikimedia.org/enwiki/"
doi: "10.5555/wikidoi234567"
uuid: "234e5678-f90a-12d3-c567-426614172345"
repo_url: "https://github.com/wikimedia/wikipedia-corpus"
- name: "Common Crawl"
author: "Common Crawl Foundation"
version: "2021-04"
organization: "Common Crawl"
description: "A dataset containing web-crawled data from various domains, providing a broad range of text."
url: "https://commoncrawl.org"
doi: "10.5555/ccdoi345678"
uuid: "345e6789-f90b-34d5-d678-426614173456"
repo_url: "https://github.com/commoncrawl/cc-crawl-data"
Will fill in these metadata field in the gguf key value store.
general.base_model.count
general.base_model.{id}.name
general.base_model.{id}.author
general.base_model.{id}.version
general.base_model.{id}.organization
general.base_model.{id}.description
general.base_model.{id}.url
general.base_model.{id}.doi
general.base_model.{id}.uuid
general.base_model.{id}.repo_url
general.dataset.count
general.dataset.{id}.name
general.dataset.{id}.author
general.dataset.{id}.version
general.dataset.{id}.organization
general.dataset.{id}.description
general.dataset.{id}.url
general.dataset.{id}.doi
general.dataset.{id}.uuid
general.dataset.{id}.repo_url
cool @mofosyne – thanks for linking https://github.com/ggerganov/llama.cpp/pull/8875
Do you have models on the HF Hub using this convention already? We can add validation so the types are hinted to be correct and we monitor how usage grows
Let's track how usage grows!
The feature hasn't been advertised anywhere at this stage... Will need to figure out the documentation next.
But in the meantime, I'll also need to figure out the most canonical form that would best fit your current model card parameters. This is because our model card parser is pretty forgiving of the various ways people randomly enter their parameter. (Plus at the time I didn't realize you defined it here in the source code).
On studying your current code base, I noticed you used model_name
rather than name
as I would have expected. So I appended the model_*
to most of the parameters except for 'license', 'tags', 'pipeline_tag' and 'language' to keep with the same pattern.
If so then this is what I think your extended model card may look like. If you change model_name
to name
in your side, then it would make sense to remove the 'model_*' parameter pattern. But either way works for me.
If you are happy with the above, then I'll update the documentation to match and you can sync to that when it gets popular.
# Model Card Fields
model_name: Example Model Six
model_author: John Smith
model_version: v1.0
model_organization: SparkExampleMind
model_description: This is an example of a model
model_quantized_by: Abbety Jenson
# Useful for cleanly regenerating default naming conventions
model_finetune: instruct
model_basename: llamabase
model_size_label: 8x2.3Q
# Licensing details
license: apache-2.0
license_name: 'Apache License Version 2.0, January 2004'
license_link: 'https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md'
# Model Location/ID
model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16/blob/main/README.md'
model_doi: 'doi:10.1080/02626667.2018.1560449'
model_uuid: f18383df-ceb9-4ef3-b929-77e4dc64787c
model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16'
# Model Source If Conversion
source_model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor/blob/main/README.md'
source_model_doi: 'doi:10.1080/02626667.2018.1560449'
source_model_uuid: 'a72998bf-3b84-4ff4-91c6-7a6b780507bc'
source_model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor'
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
- name: GPT-3
author: OpenAI
version: '3.0'
organization: OpenAI
description: A large language model capable of performing a wide variety of language tasks.
url: 'https://openai.com/research/gpt-3'
doi: 10.5555/gpt3doi123456
uuid: 123e4567-e89b-12d3-a456-426614174000
repo_url: 'https://github.com/openai/gpt-3'
- name: BERT
author: Google AI Language
version: '1.0'
organization: Google
description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
url: 'https://github.com/google-research/bert'
doi: 10.5555/bertdoi789012
uuid: 987e6543-e21a-43f3-a356-527614173999
repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
- name: Wikipedia Corpus
author: Wikimedia Foundation
version: 2021-06
organization: Wikimedia
description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
url: 'https://dumps.wikimedia.org/enwiki/'
doi: 10.5555/wikidoi234567
uuid: 234e5678-f90a-12d3-c567-426614172345
repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
- name: Common Crawl
author: Common Crawl Foundation
version: 2021-04
organization: Common Crawl
description: A dataset containing web-crawled data from various domains, providing a broad range of text.
url: 'https://commoncrawl.org'
doi: 10.5555/ccdoi345678
uuid: 345e6789-f90b-34d5-d678-426614173456
repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
- text generation
- transformer
- llama
- tiny
- tiny model
pipeline_tag:
- text-classification
language:
- en
(Note: I also noticed that 'pipeline_tag' and 'languages' is missing an 's' at the end... but that's a nitpick) (p.s. an idea to consider is to give brownie awards for repos with good metadata)
I feel that keeping model_name
for consistency with the existing but then having all other fields as "raw" (author, version, organization, etc.) is better. I find the model_*
everywhere to be very verbose. Also, similarly, I'd keep dataset_name
but then do not prepend dataset_*
everywhere.
Also, this new proposition in https://github.com/huggingface/huggingface_hub/issues/2479#issuecomment-2482798095 is adding way more fields to the model card than the suggestion in https://github.com/huggingface/huggingface_hub/issues/2479#issuecomment-2473101673. I think that adding base_model_sources
and dataset_sources
with defined specifications is fine but adding all the other fields (source_model_url, source_model, doi, model_doi, , model_uuid, model_quantized_by, model_finetune, model_organization, etc.) is too much and would bloat the model card metadata convention.
Ah I see. So the parent references should have more details for easier retrieval, but the model itself can be understood by context. Fair enough.
So if we don't dump out all the KV stuff, but just keep the stuff directly referenced in the current HF model card conventions (as defined in the python source)... plus the detailed parent model/datasets... this should look more like.
# Model Card Fields
model_name: Example Model Six
# Licensing details
license: apache-2.0
license_name: Apache License Version 2.0, January 2004
license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
- name: GPT-3
author: OpenAI
version: '3.0'
organization: OpenAI
description: A large language model capable of performing a wide variety of language tasks.
url: 'https://openai.com/research/gpt-3'
doi: 10.5555/gpt3doi123456
uuid: 123e4567-e89b-12d3-a456-426614174000
repo_url: 'https://github.com/openai/gpt-3'
- name: BERT
author: Google AI Language
version: '1.0'
organization: Google
description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
url: 'https://github.com/google-research/bert'
doi: 10.5555/bertdoi789012
uuid: 987e6543-e21a-43f3-a356-527614173999
repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
- name: Wikipedia Corpus
author: Wikimedia Foundation
version: '2021-06'
organization: Wikimedia
description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
url: 'https://dumps.wikimedia.org/enwiki/'
doi: 10.5555/wikidoi234567
uuid: 234e5678-f90a-12d3-c567-426614172345
repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
- name: Common Crawl
author: Common Crawl Foundation
version: '2021-04'
organization: Common Crawl
description: A dataset containing web-crawled data from various domains, providing a broad range of text.
url: 'https://commoncrawl.org'
doi: 10.5555/ccdoi345678
uuid: 345e6789-f90b-34d5-d678-426614173456
repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
- text generation
- transformer
- llama
- tiny
- tiny model
language:
- en
#... other HF stuff here... but isn't present from the GGUF KV store...
#... so there is no direct analog... so will be omitted on llama.cpp side of the documentation...
Well @Wauplin this does indeed look a bit more compact now. FYI, this is just going to be documentation on our side for now. But just double checking that we won't be stepping on any toes. Thumbs up if all green.
(edit: removed pipeline_tag as I remembered it's not included in the GGUF)
Nice, I can confirm that this version is not stepping on anyone's toes! :+1: Gentle ping to @ggerganov @julien-c if you want to confirm the metadata described in makes sense to you as well so we can settle this for good.
proposal looks good to me, but I would, whenever possible, also include our simpler base_model (array of model ids on the Hub)
and datasets (array of dataset ids on the Hub)
– whenever you know them – as we already have more built-in support for those
i.e. I would use the current proposal as an extension/add-on on top of existing conventional (simpler) metadata
Okay thanks. FYI placed the mapping to https://github.com/ggerganov/llama.cpp/wiki/HuggingFace-Model-Card-Metadata-Interoperability-Consideration for future reference now.
Is your feature request related to a problem? Please describe.
Was working on https://github.com/ggerganov/llama.cpp/pull/8875 to integrate some changes to how we interpret parent models and datasets into GGUF metadata and was alerted that your code currently interprets the
datasets
as onlyList[str]
while the changes we are proposing would support these types indatasets
andbase_model
:List[str]
of hugging face idList[str]
of urls to other reposList[dict]
of dict with fields like name, author, version, organization, url, doi, uuid and repo_urlDescribe the solution you'd like
Update description to indicate support for urls and dict metadata in both
datasets
andbase_model
entry in model card as well as update typechecks to support dict as an option.Describe alternatives you've considered
We already can support these extra metadata in GGUF file format via metadata override files, but it would be nice to be able to sync these feature so we can more easily grab these information from model creator's model card.
Additional context
The code area I'm looking at is https://github.com/huggingface/huggingface_hub/blob/e9cd695d7bd9e81b4eceb8f4da557a0cfa387b99/src/huggingface_hub/repocard_data.py#L249-L251