Closed julien-c closed 1 year ago
Super nice !
Actually I just thought , for the initial read, you could probably initial the first request directly on the first 100ko
maybe ?
And refetch only if needed.
This would avoid launching 2 network calls in most settings (100ko
is adjustable).
Just an optimization that might be worthwhile in production
Yes! I thought of that optimization too @Narsil. I'll probably implement it in a v2.
Update for top 100 most downloaded models: (currently 2486 models have the safetensors tag)
model | safetensors | params |
---|---|---|
bert-base-uncased | single-file | { 'F32' => 110106428 } |
jonatasgrosman/wav2vec2-large-xlsr-53-english | single-file | { 'F32' => 315472545 } |
gpt2 | single-file | { 'F32' => 137022720 } |
xlm-roberta-base | single-file | { 'F32' => 278885778 } |
roberta-base | single-file | { 'F32' => 124697433, 'I64' => 514 } |
distilbert-base-uncased | single-file | { 'F32' => 66985530 } |
t5-base | single-file | { 'F32' => 222903936 } |
xlm-roberta-large | single-file | { 'F32' => 561192082 } |
bert-base-multilingual-cased | single-file | { 'F32' => 178566653 } |
bert-base-cased | single-file | { 'F32' => 108932934 } |
distilroberta-base | single-file | { 'F32' => 82760793 } |
albert-base-v2 | single-file | { 'F32' => 11842272 } |
roberta-large | single-file | { 'F32' => 355412057, 'I64' => 514 } |
distilbert-base-uncased-finetuned-sst-2-english | single-file | { 'F32' => 66955010 } |
facebook/bart-large-mnli | single-file | { 'F32' => 407344133 } |
t5-small | single-file | { 'F32' => 60506880 } |
deepset/roberta-base-squad2 | single-file | { 'F32' => 124056578, 'I64' => 514 } |
distilbert-base-multilingual-cased | single-file | { 'F32' => 135445755 } |
bigscience/bloom-560m | single-file | { 'F16' => 559214592 } |
bert-base-chinese | single-file | { 'F32' => 102882442 } |
distilgpt2 | single-file | { 'F32' => 88204032 } |
camembert-base | single-file | { 'F32' => 111246085 } |
Jean-Baptiste/camembert-ner | single-file | { 'F32' => 110035205, 'I64' => 514 } |
bert-large-uncased | single-file | { 'F32' => 336226108 } |
gpt2-medium | single-file | { 'F32' => 379988992 } |
cambridgeltl/SapBERT-from-PubMedBERT-fulltext | single-file | { 'I64' => 512, 'F32' => 109482240 } |
facebook/bart-base | single-file | { 'F32' => 139420416 } |
bert-large-uncased-whole-word-masking-finetuned-squad | single-file | { 'F32' => 335143938 } |
distilbert-base-uncased-distilled-squad | single-file | { 'F32' => 66364418 } |
gpt2-large | single-file | { 'F32' => 811778816 } |
mrm8488/t5-base-finetuned-common_gen | single-file | { 'F32' => 296926848 } |
openai-gpt | single-file | { 'F32' => 119680512 } |
t5-large | single-file | { 'F32' => 737668608 } |
d4data/biomedical-ner-all | single-file | { 'F32' => 66427476 } |
distilbert-base-cased-distilled-squad | single-file | { 'F32' => 65192450 } |
Jean-Baptiste/roberta-large-ner-english | single-file | { 'I64' => 514, 'F32' => 354315269 } |
prompthero/openjourney | single-file | { 'F32' => 123060480, 'I64' => 77 } |
GanjinZero/UMLSBert_ENG | single-file | { 'I64' => 512, 'F32' => 109482240 } |
google/flan-t5-base | single-file | { 'F32' => 247577856 } |
google/flan-t5-large | single-file | { 'F32' => 783150080 } |
roberta-base-openai-detector | single-file | { 'F32' => 125237762 } |
mrm8488/t5-base-finetuned-summarize-news | single-file | { 'F32' => 222903936 } |
google/flan-t5-xxl | sharded | { 'F32' => 11266928640 } |
bert-base-multilingual-uncased | single-file | { 'F32' => 168055961 } |
bert-large-cased | single-file | { 'F32' => 334661958 } |
mrm8488/bert-multi-cased-finetuned-xquadv1 | single-file | { 'F32' => 177854978 } |
facebook/wav2vec2-base-960h | single-file | { 'F32' => 94395552 } |
oliverguhr/german-sentiment-bert | single-file | { 'F32' => 109083651 } |
malteos/scincl | single-file | { 'I64' => 512, 'F32' => 109918464 } |
Dizex/InstaFoodRoBERTa-NER | single-file | { 'I64' => 514, 'F32' => 124058115 } |
bert-large-uncased-whole-word-masking | single-file | { 'F32' => 336226108 } |
ltg/norbert2 | single-file | { 'I64' => 512, 'F32' => 125164986 } |
shahrukhx01/question-vs-statement-classifier | single-file | { 'I64' => 512, 'F32' => 11171074 } |
facebook/esm2_t6_8M_UR50D | single-file | { 'I64' => 1026, 'F32' => 7840842 } |
pszemraj/flan-t5-large-grammar-synthesis | single-file | { 'F32' => 783150080 } |
bigscience/bloomz-560m | single-file | { 'F16' => 559214592 } |
roberta-large-mnli | single-file | { 'F32' => 356412419 } |
Gustavosta/MagicPrompt-Stable-Diffusion | single-file | { 'F32' => 124439808, 'U8' => 12582912 } |
human-centered-summarization/financial-summarization-pegasus | single-file | { 'F32' => 568796007 } |
finiteautomata/beto-emotion-analysis | single-file | { 'I64' => 512, 'F32' => 109859335 } |
voidful/albert_chinese_small | single-file | { 'F32' => 4812936 } |
mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis | single-file | { 'I64' => 514, 'F32' => 82120707 } |
mrm8488/t5-base-finetuned-question-generation-ap | single-file | { 'F32' => 296926848 } |
nbroad/ESG-BERT | single-file | { 'I64' => 512, 'F32' => 109502234 } |
impira/layoutlm-document-qa | single-file | { 'I64' => 514, 'F32' => 127792898 } |
bert-base-german-cased | single-file | { 'F32' => 109705010 } |
aubmindlab/bert-base-arabert | single-file | { 'F32' => 135851010 } |
deepset/tinyroberta-squad2 | single-file | { 'I64' => 514, 'F32' => 81529346 } |
albert-base-v1 | single-file | { 'F32' => 11842272 } |
beomi/kcbert-base | single-file | { 'F32' => 109542194 } |
Babelscape/wikineural-multilingual-ner | single-file | { 'I64' => 512, 'F32' => 177269769 } |
rinna/japanese-gpt-1b | single-file | { 'F16' => 1327878144 } |
setu4993/LaBSE | single-file | { 'I64' => 512, 'F32' => 470926848 } |
bigscience/bloom-1b1 | single-file | { 'F16' => 1065314304 } |
sagorsarker/bangla-bert-base | single-file | { 'F32' => 165092235 } |
pszemraj/grammar-synthesis-small | single-file | { 'F32' => 76961152 } |
vicgalle/xlm-roberta-large-xnli-anli | single-file | { 'I64' => 514, 'F32' => 559893507 } |
typeform/distilbert-base-uncased-mnli | single-file | { 'F32' => 66955779 } |
distilbert-base-german-cased | single-file | { 'F32' => 67431550 } |
EleutherAI/gpt-neox-20b | sharded | { 'F16' => 20554568208, 'U8' => 184549376 } |
bigscience/bloom | sharded | { 'BF16' => 176247271424 } |
bigscience/bloom-3b | single-file | { 'F16' => 3002557440 } |
wavymulder/Analog-Diffusion | error | model id does not contain safetensors weights |
FredZhang7/distilgpt2-stable-diffusion-v2 | single-file | { 'F32' => 81912576, 'U8' => 6291456 } |
albert-xxlarge-v2 | single-file | { 'F32' => 223180256 } |
cointegrated/rubert-tiny2 | single-file | { 'I64' => 2048, 'F32' => 29376502 } |
KES/T5-KES | single-file | { 'F32' => 222903552 } |
cointegrated/LaBSE-en-ru | single-file | { 'I64' => 512, 'F32' => 128993837 } |
knkarthick/MEETING_SUMMARY | single-file | { 'F32' => 406340696 } |
rinna/japanese-roberta-base | single-file | { 'I64' => 514, 'F32' => 110652416 } |
xlm-clm-ende-1024 | single-file | { 'F32' => 208673979 } |
oliverguhr/spelling-correction-english-base | single-file | { 'F32' => 139470681 } |
lidiya/bart-large-xsum-samsum | single-file | { 'F32' => 406340696 } |
dominguesm/bert-restore-punctuation-ptbr | single-file | { 'I64' => 512, 'F32' => 108344079 } |
patrickjohncyh/fashion-clip | single-file | { 'I64' => 127, 'F32' => 151277312 } |
mrm8488/bert-spanish-cased-finetuned-pos-16-tags | single-file | { 'F32' => 109863953 } |
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | single-file | { 'I64' => 512, 'F16' => 278811651 } |
blanchefort/rubert-base-cased-sentiment-rusentiment | single-file | { 'I64' => 512, 'F32' => 177855747 } |
elastic/distilbert-base-cased-finetuned-conll03-english | single-file | { 'F32' => 65197833 } |
cointegrated/rubert-tiny-toxicity | single-file | { 'I64' => 512, 'F32' => 11785733 } |
@julien-c How is the canonical order of tensors reconstructed as seen here via huggingface.co/gpt2?show_tensors=true
The above example shows the first two tensor names aren't following lexicographical order (as intended) whereas the response returns the safetensor layout which is not in order... so does this mean the the information can be retrieved/exists somewhere programmatically!? 🙏🏼
@sparverius that's a question for @mishig25 who implemented it, but yeah we have a few heuristics we use to order the layers on the frontend side – while the API exposes the logical on-disk order of the safetensors file (we had a lot of debate about this 🤣)
We can share some pseudo-code to demonstrate what we're doing on the frontend side maybe.
but yeah we have a few heuristics we use to order the layers on the frontend side – while the API exposes the logical on-disk order of the safetensors file (we had a lot of debate about this 🤣)
Interesting, what were the main takeaways?
We can share some pseudo-code to demonstrate what we're doing on the frontend side maybe.
That would be awesome, thank you!
All thanks to the safetensors format, I have been working on a little side project building on this vision of summarizing/representing a given concrete model architecture textually and visually... hoping the results will be insightful to compare/diff models side-by-side or for gaining individual insight at a glance between models for a given task 🎨
@sparverius here is the heuristic to order the layers:
1. Split a layer name. The splitters/seperators are [".", "-", "_"]. Example: h.0.attn.c_proj.bias -> ["h", 0, "attn", "c_proj", "bias"]
2. Compare layername objects. If the current element is string, do lexiocographical order. If they are numbers, do numbers order. Ex: ["h", 0, "attn", "c_proj", "bias"] will rank higher than ["h", 1, "attn", "c_proj", "bias"] because 0 < 1 in their second elements
3. Use the below heauristic names/regexes (copied mostly from transformers naming convention), to "overwrite" the lexiocographical order
const REGEX_FIRST_LAYERS = /(embed|wte|wpe|shared)/i;
const REGEX_LAST_LAYERS = /(head|classifier)/i;
/*
Rules for comparing ParsedTensorInfo objects.
Examples:
* h.2.attn.c_proj.bias should order lower than h.11.attn.c_proj.bias because h.2 < h.11
* embedding.layer should order lower than h.2.attn.c_proj.bias because there is special susbtring "embedding"
*/
summarizing/representing a given concrete model architecture textually and visually
This sounds super interesting, i'm sure many people would be interested in this
Thanks @mishig25! Interesting, any other existing efforts to catalog different architectures?
@julien-c thanks, I hope it will be useful!
Showing that one can retrieve model information from a safetensors checkpoint shows the beauty & transparency of the format and thankfully the inspiration for this side-project 🤗 ...
I'm cautious to depend upon this fact on a more widespread level since it costs precious requests for hf serverside (a few for larger sharded models think HuggingFaceM4/idefics-80b or even tiiuae/falcon-180B) and the whole ordering thing, even though it might be a good advertisement for safetensors...
Some thoughts from what i've been running into for one, utilizing config.json
seems helpful for ordering
Perhaps this discussion for this side-project is better suited for elsewhere though, I am wondering if a community effort in cataloging model summaries might be the best way ... all thoughts welcome 🤗 !
How can one know the concrete architecture of a model at a glance without grokking paper, source code, or ultimately loading into memory?
{
"deberta": {
"class": "DebertaV2Model",
"embeddings": {
"class": "DebertaV2Embeddings",
"position_ids": "[1, 512]",
"word_embeddings": {
"class": "Embedding",
"weight": "[128100, 768]",
},
...
},
"encoder": {
"class": "DebertaV2Encoder",
"layer": {
"class": "ModuleList",
"N": {
"class": "DebertaV2Layer",
"attention": {
"class": "DebertaV2Attention",
"self": {
"class": "DisentangledSelfAttention",
"query_proj": {
"class": "Linear",
"weight": "[768, 768]", # deberta.encoder.layer.N.attention.self.query_proj.weight
"bias": "[768]", # deberta.encoder.layer.N.attention.self.query_proj.bias
},
"key_proj": { "class": "Linear", ... },
...
"pos_dropout": { "class": "StableDropout" },
"dropout": { "class": "StableDropout" }
},
"output": {
"class": "DebertaV2SelfOutput",
"dense": {
"class": "Linear",
"weight": "[768, 768]", # deberta.encoder.layer.N.attention.output.dense.weight
...
}
...
}
},
"intermediate": {
"class": "DebertaV2Intermediate",
"dense": { ... },
"intermediate_act_fn": { "class": "GELUActivation" }
},
...
},
..
},
"rel_embeddings": {
"class": "Embedding",
"weight": "[512, 768]",
},
"LayerNorm": { ... }
}
},
"pooler": { "class": "ContextPooler", ... },
"classifier": { ... },
"dropout": { "class": "StableDropout" }
}
thinking of create a repo & associated pip package
allowing users to pip install dependency
something like it runs after saving checkpoint / before push_to_hub (or in process before converting safetensor)
runs a souped up model-summary
sends pull-request to repo OR pushes file to hub?
🤩 just for fun: show lovely cli graphics think starship.rs because huggingface is fun!
Outcomes:
A catalog / repo / central place hosting model summaries
Great summary, @sparverius!!
A catalog / repo / central place hosting model summaries
IMO the best would be to place each model summary into its model repo on the HF Hub (through a Pull request or Discussion)
Also makes me think a bit of https://huggingface.co/spaces/hf-accelerate/model-memory-usage by @muellerzr
@mishig25 do you remember if we had noted somewhere public some of our thoughts about how to encode model architecture into an easy to use file format?
Great summary, @sparverius!!
Thanks!
IMO the best would be to place each model summary into its model repo on the HF Hub (through a Pull request or Discussion)
Good point, that would be the most accessible.
@mishig25 do you remember if we had noted somewhere public some of our thoughts about how to encode model architecture into an easy to use file format?
If you have that to share, that would be awesome!
Was thinking of something simple, a similar format as safetensors metadata, encoding intermediate labels as class names for example,
{
"model": "Model",
"model.embed_tokens": "Embedding",
"model.embed_tokens.weight": {"shape": [32000, 4096], "dtype": "float16"},
"model.layers": "ModuleList",
"model.layers.0": "DecoderLayer",
"model.layers.0.self_attn": "Attention",
"model.layers.0.self_attn.rotary_emb": "RotaryEmbedding",
"model.layers.0.self_attn.rotary_emb.inv_freq": {"shape": [64], "dtype": "float32"},
"model.layers.0.self_attn.rotary_emb.cos_cached": {"shape": [1, 1, 4096, 128], "dtype": "float16"},
"model.layers.0.self_attn.rotary_emb.sin_cached": {"shape": [1, 1, 4096, 128], "dtype": "float16"},
"model.layers.0.self_attn.k_proj": "QuantLinear",
"model.layers.0.self_attn.k_proj.qweight": {"shape": [512, 4096], "dtype": "int32"},
"model.layers.0.self_attn.k_proj.qzeros": {"shape": [32, 512], "dtype": "int32"},
"model.layers.0.self_attn.k_proj.scales": {"shape": [32, 4096], "dtype": "float16"},
"model.layers.0.self_attn.k_proj.g_idx": {"shape": [4096], "dtype": "int32"},
"model.layers.0.self_attn.k_proj.bias": {"shape": [4096], "dtype": "float16"},
...
"model.layers.0.mlp": "MLP",
"model.layers.0.mlp.act_fn": "SiLUActivation",
...
"model.layers.0.mlp.up_proj": "QuantLinear",
"model.layers.0.mlp.up_proj.qweight": {"shape": [512, 11008], "dtype": "int32"},
"model.layers.0.mlp.up_proj.qzeros": {"shape": [32, 1376], "dtype": "int32"},
"model.layers.0.mlp.up_proj.scales": {"shape": [32, 11008], "dtype": "float16"},
"model.layers.0.mlp.up_proj.g_idx": {"shape": [4096], "dtype": "int32"},
"model.layers.0.mlp.up_proj.bias": {"shape": [11008], "dtype": "float16"},
"model.layers.0.input_layernorm": "RMSNorm",
"model.layers.0.input_layernorm.weight": {"shape": [4096], "dtype": "float16"},
"model.layers.0.post_attention_layernorm": "RMSNorm",
"model.layers.0.post_attention_layernorm.weight": {"shape": [4096], "dtype": "float16"},
...
}
This has the advantage of being able to cross check against safetensors, complements safetensor metadata with a bit of added info and can easily be jsonized...
@mishig25 do you remember if we had noted somewhere public some of our thoughts about how to encode model architecture into an easy to use file format?
There was no public discussion. Internally, you've posted
Can ONNX refer to external weights, i.e. for instance could a ONNX file only represent the computation graph, but point to a safetensors file for the actual weights? (maybe through an extension)
btw @sparverius I assume you've seen this doc page https://huggingface.co/docs/safetensors/metadata_parsing ?
@julien-c if I may ask, what is the method to calculate the parameter count of a model? I am thinking of maybe creating a script to detect differences in the parameter size as compared to model name. I know regex well, so the model name is no problem, but I am currently stuck at calculating the parameter size. Maybe I can find a way to clean up the mess on the huggingface open llm leaderboard.
@ThiloteE we just sum the number of parameters in all the tensors
There's also a python implementation in case it is more readable: https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/utils/_safetensors.py
In this branch: https://github.com/huggingface/safetensors/compare/julien-c/js I pushed a proof-of-concept of how, given the simplicity of the format, one can fetch metadata about the weights over small (Range) HTTP requests.
The code is JS (can run in browsers or in Node) but it would be similar in any language.
Here's an example on how to fetch the header in a single file for instance:
where a
FileHeader
type is defined as:Results
As a fun first experiment, I compute the number of params per dtype for all models currently with a safetensors version on the HuggingFace Hub.
Here's the results:
Thought it'd be fun to share! cc @mishig25 @osanseviero too