huggingface / safetensors

Simple, safe way to store and distribute tensors
https://huggingface.co/docs/safetensors
Apache License 2.0
2.89k stars 199 forks source link

Support complex JSON metadata #157

Closed PlasmaPower closed 9 months ago

PlasmaPower commented 1 year ago

Currently, metadata is limited to a string->string mapping. It would be much more powerful if metadata was able an arbitrary json value. This would allow, for instance, integers and nested dictionaries in metadata. This could be safely implemented by having the metadata field's type in rust be a serde_json::Value (to be clear I'm talking about the metadata field of the Metadata struct, not the struct itself). I'm unsure how the python bindings would work exactly, but likely it'd be serialized as a string on the python side and then temporarily deserialized on the rust side.

This wouldn't present a DoS vector because there's already a metadata size limit and serde_json has a deserialization depth limit to avoid a stack overflow. The alternative is letting the user implement a more advanced metadata format themselves by e.g. serializing their data to json and then storing that as a string in metadata, which seems more error prone than providing support in safetensors directly.

I'm looking at using this to store training metadata of stable diffusion textual inversion embeddings while supporting the safetensors format (see https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/6625). I'd be happy to implement this but I wanted to make sure such a PR would be accepted first.

Narsil commented 1 year ago

Hi @PlasmaPower ,

Thanks for the suggestion. Do you mind sharing what kind of metadata that is ? And how it is relevant to how you load the tensors themselves ? Like why this data needs to be close enough to be within the file.

I relunctantly add new features because they add more complexity, and the current triviality of the format is one of the strength of it (imo). That being said, the goal is also to be helpful so..

For instance the __metadata__ field with string|string mapping was added because within transformers we would like to use only 1 file for everymodel in TF/Jax/Pytorch. And since those formats have different default layouts (NCHW, vs NHWC for instance) we need some kind of metadata to be able to load the tensors correctly if it was saved in PT and needs to be loaded in Jax for instance. Since it can also allow to save things like version number, maybe dates and other such information I agreed to it. It enabled a clear use case which would have been very tricky to do without it (Loader could always have looked at the tensors themselves, but that require to know where a given tensor is supposed to be used which is painful and bug enabling).

I would like to have the same rational here. So since those files seem to be loadable correctly, why can't they be used to be saved ? And then, what is the exact extent of the data to be stored ? (Storing json.Value is not convenient to store especially since it has to be sent to Python afterwards, nothing not doable, but inconvenient).

thomasw21 commented 1 year ago

Hey @Narsil!

I would also like to have sligthly more complexe json, typically I want to store int and possibly list[int]. I'm currently manually serializing it, but it's not the most satisfying.

Narsil commented 1 year ago

What are these int and List[int] ?

How many are they ?

thomasw21 commented 1 year ago

If the confusion is about the fact I used python type hinting, my bad. Basically I want to store the following object:

{ "concat_dim": 1, "concat_view_as": [2,3,5]}

It's usually quite small, and so I don't think it would hit any size issue.

Narsil commented 1 year ago

I don't understand what this is supposed to do.

Are you storing a tensor and wanting to concatenate on load or something ?

PlasmaPower commented 1 year ago

One part of the use case for me is keeping compatibility with the normal pickle torch loading. Stable diffusion textual inversion embeddings look like this:

{'string_to_token': {'*': 265}, 'string_to_param': {'*': tensor([...], requires_grad=True)}, 'name': 'bad_prompt_version2', 'step': 12000, 'sd_checkpoint': '74906b9a', 'sd_checkpoint_name': 'Novel50WD50'}

Hypernetworks have other complex metadata such as 'layer_structure': [1.0, 2.0, 1.0], 'activation_func': 'linear', 'is_layer_norm': False

While this information would be encodable with the current metadata scheme, trying to adapt it would mean essentially writing a new serializer and deserializer, or just json-encoding it and then putting it in a string field of metadata. The ideal solution seems to me like allowing full json directly in the metadata.

Narsil commented 1 year ago

Thanks for sharing the textual inversion use case, I'll take a closer look to see how to implement it in SD/diffusers.

I am under the impression we can do better without supporting these things in the file itself (just like the entire pytorch_lightning thing is not necessary in SD). But I could be mistaken.

Narsil commented 1 year ago

It seems it's supported by SD already no ? https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/textual_inversion/textual_inversion.py#L159

And as I expected, it seems all the training info is just ditched.

PlasmaPower commented 1 year ago

That's actually my patch to support it haha https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/6625

But I'd like to support training with them, and hypernetworks as well.

Narsil commented 1 year ago

haha

Didn't even realize, nice ! and thanks (I forgot the start of this thread in the meantime).

Supporting aribtrary JSON will necessarily open DOS attacks: https://edward-huang.com/programming/2021/03/09/5-json-denial-attack-that-every-hacker-take-advantage-of/ This is why I'm hesitant to add it.

JSON in string in JSON seems very questionable at best, but safetensors is kind of pushing you into it.

There is going to be an external audit on the lib and the format. I think we can wait for its conclusion. If DOS is unavoidable by any means I think we could add it. But I'd like to keep that property as much as possible, it's a really nice property for webservice/serverless etc...

PlasmaPower commented 1 year ago

In combination with the size limit you've already implemented, serde_json already protects you against those attacks by default :) https://docs.rs/serde_json/latest/serde_json/de/struct.Deserializer.html#method.disable_recursion_limit

I also don't think there's hashmap based DoS issues because rust randomizes its hashmap seeds to prevent exactly that

Narsil commented 1 year ago

I need to check those claims. I don't blindly trust docs. :)

Stack recursion is only 1 type of issues.

williamberman commented 1 year ago

Chiming in from diffusers :) We do not store metadata in the safetensors file. We store metadata in separate json files that all model configuration is loaded from separately and before the tensors themselves are actually loaded. This is flexible, allows re-using the same metadata with pickle serialization, and lets metadata be human editable (don't need to load all tensors if need to manually tweak some values).

I'd generally recommend others do the same and avoid storing metadata with tensors if possible (If the safetensors library needs to store its own metadata, that's separate and makes sense to me)

iacore commented 1 year ago

I don't care about DOS attack when loading a file from disk. At worst it will not load, and I will know it's broken.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.