huggingface / safetensors

Simple, safe way to store and distribute tensors
https://huggingface.co/docs/safetensors
Apache License 2.0
2.6k stars 162 forks source link

Ability to remotely parse metadata over small HTTP requests #44

Closed julien-c closed 1 year ago

julien-c commented 1 year ago

In this branch: https://github.com/huggingface/safetensors/compare/julien-c/js I pushed a proof-of-concept of how, given the simplicity of the format, one can fetch metadata about the weights over small (Range) HTTP requests.

The code is JS (can run in browsers or in Node) but it would be similar in any language.

Here's an example on how to fetch the header in a single file for instance:

async function parseSingleFile(url: URL): Promise<FileHeader> {
    const bufLengthOfHeaderLE = await (
        await fetch(url, {
            headers: {
                Range: "bytes=0-7",
            },
        })
    ).arrayBuffer();
    const lengthOfHeader = new DataView(bufLengthOfHeaderLE).getBigUint64(
        0,
        true
    );
    /// ^little-endian
    const header: FileHeader = await (
        await fetch(url, {
            headers: {
                Range: `bytes=8-${7 + Number(lengthOfHeader)}`,
            },
        })
    ).json();
    /// no validation for now, we assume it's a valid FileHeader.
    return header;
}

where a FileHeader type is defined as:

type TensorName = string;
type Dtype =
    | "F64"
    | "F32"
    | "F16"
    | "I64"
    | "I32"
    | "I16"
    | "I8"
    | "U8"
    | "BOOL";

interface TensorInfo {
    dtype: Dtype;
    shape: number[];
    data_offsets: [number, number];
}

type FileHeader = Record<TensorName, TensorInfo> & {
    __metadata__: Record<string, string>;
};

Results

As a fun first experiment, I compute the number of params per dtype for all models currently with a safetensors version on the HuggingFace Hub.

Here's the results:

model safetensors params
gpt2 single-file { 'F32' => 137022720 }
roberta-base single-file { 'F32' => 124697433, 'I64' => 514 }
Jean-Baptiste/camembert-ner single-file { 'F32' => 110035205, 'I64' => 514 }
roberta-large single-file { 'F32' => 355412057, 'I64' => 514 }
bigscience/bloom-560m single-file { 'F16' => 559214592 }
hf-internal-testing/tiny-random-bert-safetensors single-file { 'F32' => 127463, 'I64' => 512 }
hf-internal-testing/tiny-random-bert-sharded-safetensors index-file { 'F32' => 87929, 'I64' => 512 }
Narsil/small3 index-file { 'F32' => 59159, 'I64' => 512 }
Narsil/small2 single-file { 'F32' => 59159, 'I64' => 512 }
hf-internal-testing/tiny-random-bert-safetensors-tf single-file { 'F32' => 87929 }

Thought it'd be fun to share! cc @mishig25 @osanseviero too

Narsil commented 1 year ago

Super nice !

Actually I just thought , for the initial read, you could probably initial the first request directly on the first 100ko maybe ? And refetch only if needed. This would avoid launching 2 network calls in most settings (100ko is adjustable).

Just an optimization that might be worthwhile in production

julien-c commented 1 year ago

Yes! I thought of that optimization too @Narsil. I'll probably implement it in a v2.

julien-c commented 1 year ago

Update for top 100 most downloaded models: (currently 2486 models have the safetensors tag)

model safetensors params
bert-base-uncased single-file { 'F32' => 110106428 }
jonatasgrosman/wav2vec2-large-xlsr-53-english single-file { 'F32' => 315472545 }
gpt2 single-file { 'F32' => 137022720 }
xlm-roberta-base single-file { 'F32' => 278885778 }
roberta-base single-file { 'F32' => 124697433, 'I64' => 514 }
distilbert-base-uncased single-file { 'F32' => 66985530 }
t5-base single-file { 'F32' => 222903936 }
xlm-roberta-large single-file { 'F32' => 561192082 }
bert-base-multilingual-cased single-file { 'F32' => 178566653 }
bert-base-cased single-file { 'F32' => 108932934 }
distilroberta-base single-file { 'F32' => 82760793 }
albert-base-v2 single-file { 'F32' => 11842272 }
roberta-large single-file { 'F32' => 355412057, 'I64' => 514 }
distilbert-base-uncased-finetuned-sst-2-english single-file { 'F32' => 66955010 }
facebook/bart-large-mnli single-file { 'F32' => 407344133 }
t5-small single-file { 'F32' => 60506880 }
deepset/roberta-base-squad2 single-file { 'F32' => 124056578, 'I64' => 514 }
distilbert-base-multilingual-cased single-file { 'F32' => 135445755 }
bigscience/bloom-560m single-file { 'F16' => 559214592 }
bert-base-chinese single-file { 'F32' => 102882442 }
distilgpt2 single-file { 'F32' => 88204032 }
camembert-base single-file { 'F32' => 111246085 }
Jean-Baptiste/camembert-ner single-file { 'F32' => 110035205, 'I64' => 514 }
bert-large-uncased single-file { 'F32' => 336226108 }
gpt2-medium single-file { 'F32' => 379988992 }
cambridgeltl/SapBERT-from-PubMedBERT-fulltext single-file { 'I64' => 512, 'F32' => 109482240 }
facebook/bart-base single-file { 'F32' => 139420416 }
bert-large-uncased-whole-word-masking-finetuned-squad single-file { 'F32' => 335143938 }
distilbert-base-uncased-distilled-squad single-file { 'F32' => 66364418 }
gpt2-large single-file { 'F32' => 811778816 }
mrm8488/t5-base-finetuned-common_gen single-file { 'F32' => 296926848 }
openai-gpt single-file { 'F32' => 119680512 }
t5-large single-file { 'F32' => 737668608 }
d4data/biomedical-ner-all single-file { 'F32' => 66427476 }
distilbert-base-cased-distilled-squad single-file { 'F32' => 65192450 }
Jean-Baptiste/roberta-large-ner-english single-file { 'I64' => 514, 'F32' => 354315269 }
prompthero/openjourney single-file { 'F32' => 123060480, 'I64' => 77 }
GanjinZero/UMLSBert_ENG single-file { 'I64' => 512, 'F32' => 109482240 }
google/flan-t5-base single-file { 'F32' => 247577856 }
google/flan-t5-large single-file { 'F32' => 783150080 }
roberta-base-openai-detector single-file { 'F32' => 125237762 }
mrm8488/t5-base-finetuned-summarize-news single-file { 'F32' => 222903936 }
google/flan-t5-xxl sharded { 'F32' => 11266928640 }
bert-base-multilingual-uncased single-file { 'F32' => 168055961 }
bert-large-cased single-file { 'F32' => 334661958 }
mrm8488/bert-multi-cased-finetuned-xquadv1 single-file { 'F32' => 177854978 }
facebook/wav2vec2-base-960h single-file { 'F32' => 94395552 }
oliverguhr/german-sentiment-bert single-file { 'F32' => 109083651 }
malteos/scincl single-file { 'I64' => 512, 'F32' => 109918464 }
Dizex/InstaFoodRoBERTa-NER single-file { 'I64' => 514, 'F32' => 124058115 }
bert-large-uncased-whole-word-masking single-file { 'F32' => 336226108 }
ltg/norbert2 single-file { 'I64' => 512, 'F32' => 125164986 }
shahrukhx01/question-vs-statement-classifier single-file { 'I64' => 512, 'F32' => 11171074 }
facebook/esm2_t6_8M_UR50D single-file { 'I64' => 1026, 'F32' => 7840842 }
pszemraj/flan-t5-large-grammar-synthesis single-file { 'F32' => 783150080 }
bigscience/bloomz-560m single-file { 'F16' => 559214592 }
roberta-large-mnli single-file { 'F32' => 356412419 }
Gustavosta/MagicPrompt-Stable-Diffusion single-file { 'F32' => 124439808, 'U8' => 12582912 }
human-centered-summarization/financial-summarization-pegasus single-file { 'F32' => 568796007 }
finiteautomata/beto-emotion-analysis single-file { 'I64' => 512, 'F32' => 109859335 }
voidful/albert_chinese_small single-file { 'F32' => 4812936 }
mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis single-file { 'I64' => 514, 'F32' => 82120707 }
mrm8488/t5-base-finetuned-question-generation-ap single-file { 'F32' => 296926848 }
nbroad/ESG-BERT single-file { 'I64' => 512, 'F32' => 109502234 }
impira/layoutlm-document-qa single-file { 'I64' => 514, 'F32' => 127792898 }
bert-base-german-cased single-file { 'F32' => 109705010 }
aubmindlab/bert-base-arabert single-file { 'F32' => 135851010 }
deepset/tinyroberta-squad2 single-file { 'I64' => 514, 'F32' => 81529346 }
albert-base-v1 single-file { 'F32' => 11842272 }
beomi/kcbert-base single-file { 'F32' => 109542194 }
Babelscape/wikineural-multilingual-ner single-file { 'I64' => 512, 'F32' => 177269769 }
rinna/japanese-gpt-1b single-file { 'F16' => 1327878144 }
setu4993/LaBSE single-file { 'I64' => 512, 'F32' => 470926848 }
bigscience/bloom-1b1 single-file { 'F16' => 1065314304 }
sagorsarker/bangla-bert-base single-file { 'F32' => 165092235 }
pszemraj/grammar-synthesis-small single-file { 'F32' => 76961152 }
vicgalle/xlm-roberta-large-xnli-anli single-file { 'I64' => 514, 'F32' => 559893507 }
typeform/distilbert-base-uncased-mnli single-file { 'F32' => 66955779 }
distilbert-base-german-cased single-file { 'F32' => 67431550 }
EleutherAI/gpt-neox-20b sharded { 'F16' => 20554568208, 'U8' => 184549376 }
bigscience/bloom sharded { 'BF16' => 176247271424 }
bigscience/bloom-3b single-file { 'F16' => 3002557440 }
wavymulder/Analog-Diffusion error model id does not contain safetensors weights
FredZhang7/distilgpt2-stable-diffusion-v2 single-file { 'F32' => 81912576, 'U8' => 6291456 }
albert-xxlarge-v2 single-file { 'F32' => 223180256 }
cointegrated/rubert-tiny2 single-file { 'I64' => 2048, 'F32' => 29376502 }
KES/T5-KES single-file { 'F32' => 222903552 }
cointegrated/LaBSE-en-ru single-file { 'I64' => 512, 'F32' => 128993837 }
knkarthick/MEETING_SUMMARY single-file { 'F32' => 406340696 }
rinna/japanese-roberta-base single-file { 'I64' => 514, 'F32' => 110652416 }
xlm-clm-ende-1024 single-file { 'F32' => 208673979 }
oliverguhr/spelling-correction-english-base single-file { 'F32' => 139470681 }
lidiya/bart-large-xsum-samsum single-file { 'F32' => 406340696 }
dominguesm/bert-restore-punctuation-ptbr single-file { 'I64' => 512, 'F32' => 108344079 }
patrickjohncyh/fashion-clip single-file { 'I64' => 127, 'F32' => 151277312 }
mrm8488/bert-spanish-cased-finetuned-pos-16-tags single-file { 'F32' => 109863953 }
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli single-file { 'I64' => 512, 'F16' => 278811651 }
blanchefort/rubert-base-cased-sentiment-rusentiment single-file { 'I64' => 512, 'F32' => 177855747 }
elastic/distilbert-base-cased-finetuned-conll03-english single-file { 'F32' => 65197833 }
cointegrated/rubert-tiny-toxicity single-file { 'I64' => 512, 'F32' => 11785733 }
sparverius commented 9 months ago

@julien-c How is the canonical order of tensors reconstructed as seen here via huggingface.co/gpt2?show_tensors=true

hf_gpt2__show_tensors

The above example shows the first two tensor names aren't following lexicographical order (as intended) whereas the response returns the safetensor layout which is not in order... so does this mean the the information can be retrieved/exists somewhere programmatically!? 🙏🏼

julien-c commented 9 months ago

@sparverius that's a question for @mishig25 who implemented it, but yeah we have a few heuristics we use to order the layers on the frontend side – while the API exposes the logical on-disk order of the safetensors file (we had a lot of debate about this 🤣)

We can share some pseudo-code to demonstrate what we're doing on the frontend side maybe.

sparverius commented 9 months ago

but yeah we have a few heuristics we use to order the layers on the frontend side – while the API exposes the logical on-disk order of the safetensors file (we had a lot of debate about this 🤣)

Interesting, what were the main takeaways?

We can share some pseudo-code to demonstrate what we're doing on the frontend side maybe.

That would be awesome, thank you!

All thanks to the safetensors format, I have been working on a little side project building on this vision of summarizing/representing a given concrete model architecture textually and visually... hoping the results will be insightful to compare/diff models side-by-side or for gaining individual insight at a glance between models for a given task 🎨

mishig25 commented 9 months ago

@sparverius here is the heuristic to order the layers:

1. Split a layer name. The splitters/seperators are [".", "-", "_"]. Example: h.0.attn.c_proj.bias -> ["h", 0, "attn", "c_proj", "bias"]
2. Compare layername objects. If the current element is string, do lexiocographical order. If they are numbers, do numbers order. Ex: ["h", 0, "attn", "c_proj", "bias"] will rank higher than ["h", 1, "attn", "c_proj", "bias"] because 0 < 1 in their second elements
3. Use the below heauristic names/regexes (copied mostly from transformers naming convention), to "overwrite" the lexiocographical order

const REGEX_FIRST_LAYERS = /(embed|wte|wpe|shared)/i;
const REGEX_LAST_LAYERS = /(head|classifier)/i;
/*
Rules for comparing ParsedTensorInfo objects.
Examples:
* h.2.attn.c_proj.bias should order lower than h.11.attn.c_proj.bias because h.2 < h.11
* embedding.layer should order lower than h.2.attn.c_proj.bias because there is special susbtring "embedding"
*/
julien-c commented 9 months ago

summarizing/representing a given concrete model architecture textually and visually

This sounds super interesting, i'm sure many people would be interested in this

sparverius commented 9 months ago

Thanks @mishig25! Interesting, any other existing efforts to catalog different architectures?

sparverius commented 9 months ago

@julien-c thanks, I hope it will be useful!

Showing that one can retrieve model information from a safetensors checkpoint shows the beauty & transparency of the format and thankfully the inspiration for this side-project 🤗 ...

I'm cautious to depend upon this fact on a more widespread level since it costs precious requests for hf serverside (a few for larger sharded models think HuggingFaceM4/idefics-80b or even tiiuae/falcon-180B) and the whole ordering thing, even though it might be a good advertisement for safetensors...

Some thoughts from what i've been running into for one, utilizing config.json seems helpful for ordering

EXPAND for details on how that might be useful for transformers, diffusers, timm models ### for transformers Certain heuristics for encoder-only, encoder-decoder, decoder only ... Llama-2-7b-hf config.json gives us some insights on shapes, # layers, etc ```json { "architectures": [ "LlamaForCausalLM" ], ... "hidden_size": 4096, ... "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, ... "vocab_size": 32000 } ``` what about an architecture that hasn't been pulled into transformers lib yet microsoft/phi-1_5 gives estimate of tensor shapes etc, but doesn't tell us what each layer is composed of unless one looks at custom code... ```json { ... "architecture": { ... "block_cls": "parallel", ... }, "architectures": [ "MixFormerSequentialForCausalLM" ], "auto_map": { ... "AutoModelForCausalLM": ... }, ... "model_type": "mixformer-sequential", "n_embd": 2048, "n_head": 32, "n_inner": null, "n_layer": 24, "n_positions": 2048, ... "rotary_dim": 32, ... "vocab_size": 51200 } ``` perhaps https://huggingface.co/microsoft/phi-1_5/blob/main/tokenizer_config.json gives us a hint? ``` { ... "tokenizer_class": "CodeGenTokenizer", ... } ``` other interesting cases: [facebook/maskformer-swin-large-coco](https://huggingface.co/facebook/maskformer-swin-large-coco/tree/main) --- ### for timm: https://huggingface.co/timm/resnet50.a1_in1k/blob/main/config.json ```json { "architecture": "resnet50", ... "first_conv": "conv1", "classifier": "fc", ... } ``` resnet50 --- ### for diffusers: it gets a bit more complicated https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/raw/main/model_index.json this one involves some pointer chasing ☞ ✦ ✧ ❂ 🏆 ```json { "_class_name": "StableDiffusionXLPipeline", "text_encoder": [ "transformers", "CLIPTextModel" ], ... "unet": [ "diffusers", "UNet2DConditionModel" ], "vae": [ "diffusers", "AutoencoderKL" ] } ``` --- ### Other Errata: Ambiguities arise: What is Big🐤? https://github.com/huggingface/transformers/blob/dcbfd93d7aeb14f8ff08a48866d2a68950d4c69a/templates/adding_a_new_model/open_model_proposals/ADD_BIG_BIRD.md?plain=1#L217-L228 🤔 https://github.com/huggingface/transformers/blob/dcbfd93d7aeb14f8ff08a48866d2a68950d4c69a/templates/adding_a_new_model/open_model_proposals/ADD_BIG_BIRD.md?plain=1#L256-L257
sparverius commented 9 months ago

TL;DR 🤖

Perhaps this discussion for this side-project is better suited for elsewhere though, I am wondering if a community effort in cataloging model summaries might be the best way ... all thoughts welcome 🤗 !

Problem Statement

How can one know the concrete architecture of a model at a glance without grokking paper, source code, or ultimately loading into memory?

Related existing tools: ### Text-based - model.summary() - torchinfo/torch-summary https://pypi.org/project/torchinfo/ ### Visual - tensorboard graphs [example]( https://tensorboard.dev/experiment/EDZb7XgKSBKo6Gznh3i8hg/#graphs&run=lr_1E-04%2Cconv%3D1%2Cfc%3D2) - https://github.com/paulgavrikov/visualkeras - Torchlens https://github.com/johnmarktaylor91/torchlens - pytorchviz https://github.com/szagoruyko/pytorchviz

Ideas

{
    "deberta": {
        "class": "DebertaV2Model",
        "embeddings": {
            "class": "DebertaV2Embeddings",
            "position_ids": "[1, 512]",
            "word_embeddings": {
                "class": "Embedding",
                "weight": "[128100, 768]",
            },
            ...
        },
        "encoder": {
            "class": "DebertaV2Encoder",
            "layer": {
                "class": "ModuleList",
                "N": {
                    "class": "DebertaV2Layer",
                    "attention": {
                        "class": "DebertaV2Attention",
                        "self": {
                            "class": "DisentangledSelfAttention",
                            "query_proj": {
                                "class": "Linear",
                                "weight": "[768, 768]", # deberta.encoder.layer.N.attention.self.query_proj.weight
                                "bias": "[768]", # deberta.encoder.layer.N.attention.self.query_proj.bias
                            },
                            "key_proj": { "class": "Linear", ... },
                            ... 
                            "pos_dropout": { "class": "StableDropout" },
                            "dropout": { "class": "StableDropout" }
                        },
                        "output": {
                            "class": "DebertaV2SelfOutput",
                            "dense": {
                                "class": "Linear",
                                "weight": "[768, 768]", # deberta.encoder.layer.N.attention.output.dense.weight
                                ...
                            }
                            ...
                        }
                    },
                    "intermediate": {
                        "class": "DebertaV2Intermediate",
                        "dense": { ... },
                        "intermediate_act_fn": { "class": "GELUActivation" }
                    },
                    ...
                },
                ..
            },
           "rel_embeddings": {
                "class": "Embedding",
                "weight": "[512, 768]",
            },
            "LayerNorm": { ... }
        }
    },
    "pooler": { "class": "ContextPooler", ... },
    "classifier": { ... },
    "dropout": { "class": "StableDropout" }
}

Outcomes:

A catalog / repo / central place hosting model summaries

julien-c commented 9 months ago

Great summary, @sparverius!!

A catalog / repo / central place hosting model summaries

IMO the best would be to place each model summary into its model repo on the HF Hub (through a Pull request or Discussion)

Also makes me think a bit of https://huggingface.co/spaces/hf-accelerate/model-memory-usage by @muellerzr

@mishig25 do you remember if we had noted somewhere public some of our thoughts about how to encode model architecture into an easy to use file format?

sparverius commented 9 months ago

Great summary, @sparverius!!

Thanks!

IMO the best would be to place each model summary into its model repo on the HF Hub (through a Pull request or Discussion)

Good point, that would be the most accessible.

@mishig25 do you remember if we had noted somewhere public some of our thoughts about how to encode model architecture into an easy to use file format?

If you have that to share, that would be awesome!

Was thinking of something simple, a similar format as safetensors metadata, encoding intermediate labels as class names for example,

{
 "model": "Model",
 "model.embed_tokens": "Embedding",
 "model.embed_tokens.weight": {"shape": [32000, 4096], "dtype": "float16"},
 "model.layers": "ModuleList",
 "model.layers.0": "DecoderLayer",
 "model.layers.0.self_attn": "Attention",
 "model.layers.0.self_attn.rotary_emb": "RotaryEmbedding",
 "model.layers.0.self_attn.rotary_emb.inv_freq": {"shape": [64], "dtype": "float32"},
 "model.layers.0.self_attn.rotary_emb.cos_cached": {"shape": [1, 1, 4096, 128], "dtype": "float16"},
 "model.layers.0.self_attn.rotary_emb.sin_cached": {"shape": [1, 1, 4096, 128], "dtype": "float16"},
 "model.layers.0.self_attn.k_proj": "QuantLinear",
 "model.layers.0.self_attn.k_proj.qweight": {"shape": [512, 4096], "dtype": "int32"},
 "model.layers.0.self_attn.k_proj.qzeros": {"shape": [32, 512], "dtype": "int32"},
 "model.layers.0.self_attn.k_proj.scales": {"shape": [32, 4096], "dtype": "float16"},
 "model.layers.0.self_attn.k_proj.g_idx": {"shape": [4096], "dtype": "int32"},
 "model.layers.0.self_attn.k_proj.bias": {"shape": [4096], "dtype": "float16"},
 ...
 "model.layers.0.mlp": "MLP",
 "model.layers.0.mlp.act_fn": "SiLUActivation",
 ...
 "model.layers.0.mlp.up_proj": "QuantLinear",
 "model.layers.0.mlp.up_proj.qweight": {"shape": [512, 11008], "dtype": "int32"},
 "model.layers.0.mlp.up_proj.qzeros": {"shape": [32, 1376], "dtype": "int32"},
 "model.layers.0.mlp.up_proj.scales": {"shape": [32, 11008], "dtype": "float16"},
 "model.layers.0.mlp.up_proj.g_idx": {"shape": [4096], "dtype": "int32"},
 "model.layers.0.mlp.up_proj.bias": {"shape": [11008], "dtype": "float16"},
 "model.layers.0.input_layernorm": "RMSNorm",
 "model.layers.0.input_layernorm.weight": {"shape": [4096], "dtype": "float16"},
 "model.layers.0.post_attention_layernorm": "RMSNorm",
 "model.layers.0.post_attention_layernorm.weight": {"shape": [4096], "dtype": "float16"},
 ...
}

This has the advantage of being able to cross check against safetensors, complements safetensor metadata with a bit of added info and can easily be jsonized...

mishig25 commented 9 months ago

@mishig25 do you remember if we had noted somewhere public some of our thoughts about how to encode model architecture into an easy to use file format?

There was no public discussion. Internally, you've posted

Can ONNX refer to external weights, i.e. for instance could a ONNX file only represent the computation graph, but point to a safetensors file for the actual weights? (maybe through an extension)

mishig25 commented 9 months ago

btw @sparverius I assume you've seen this doc page https://huggingface.co/docs/safetensors/metadata_parsing ?

ThiloteE commented 1 month ago

@julien-c if I may ask, what is the method to calculate the parameter count of a model? I am thinking of maybe creating a script to detect differences in the parameter size as compared to model name. I know regex well, so the model name is no problem, but I am currently stuck at calculating the parameter size. Maybe I can find a way to clean up the mess on the huggingface open llm leaderboard.

julien-c commented 1 month ago

@ThiloteE we just sum the number of parameters in all the tensors

There's also a python implementation in case it is more readable: https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/utils/_safetensors.py