huggingface / safetensors

Simple, safe way to store and distribute tensors
https://huggingface.co/docs/safetensors
Apache License 2.0
2.84k stars 194 forks source link

Some clarification about PyTorch format #18

Closed dzhulgakov closed 9 months ago

dzhulgakov commented 2 years ago

The table in the README.md is not entirely correct. PyTorch's default format is not just pickle, but a Zip file with raw tensor data and small-ish pickle file that just refers to the tensor data. It's been the case since version 1.6, see https://github.com/pytorch/pytorch/blob/master/torch/serialization.py#L405 and _use_new_zipfile_serialization.

This means that this format:

The added benefit of the ZIP file is that it's ubiquitous - can be easily inspected (or modified!) with regular tools.

So the only outstanding problem is safe pickle implementation. Luckily we have one already implemented in torch.jit, it just needs a minor polish.

Here's a colab demonstrating internals of the format: https://colab.research.google.com/drive/1DUoqwbmBTiDsTUggZsBJL8UIKv_AjKSa?usp=sharing

cc @Narsil @sgugger

Narsil commented 2 years ago

Hi @dzhulgakov ,

Thanks for the clarification. I was personally aware of all of this but didn't want to dive too deep into the internals, the biggest flaw we want to get away from is pickle. Reimplementing pickle in a safe subset is really nice (although I don't really understand why going that route is necessary since we could have a cleaner format because a breaking change IS necessary).

Also just to mention, that as long as pytorch enables loading old formats https://github.com/pytorch/pytorch/blob/936e93058b2781d6cee2da59cccba051726dd46f/torch/serialization.py#L764 then the format will be unsafe regardless of the claims of the new format. This is the big breaking change necessary for PyTorch.

Zip file themselves are not really safe despite being ubiquitous Zip bomb. Since PT is not even using compression (if I understand correctly) it's opening a door without any benefit I can understand.

can be made "almost" zero-copy as the tensor data can be mmapped and transferred to the desired location one by one. We could consider "delayed" reading

PT format is super fast for sure ! It's actually faster than this by a small but existing margin most of the time. I haven't really dived but I'm guessing the issues lies in the bindings (or maybe because this lib doesn't try enough to make reads serialized or stuff like this)

"zero-copy" is lying anyways since the tensor has to own the buffer to do anything useful with it, so for every possible tensor there needs to be at least one copy. (Since it was a selling factor for other serialization format, I felt compelled to put it here but it's not really providing any value, the real value is loading times IMO).

The "delayed" reading would be super helpful for distributed loading. @thomasw21 might be interested. The biggest culprit afaik, is that there is no way to load a slice of a tensor (which is the default way to load things in a distributed setting when using tensor parallelism)

So the only outstanding problem is safe pickle implementation.

That's one way to go about it. I really cannot wrap my head around why this seems better to reimplement pickle than using a dead simple format like the one proposed here (it's probably not perfect, but it's trivial to implement in any language).

dzhulgakov commented 2 years ago

Thanks for the expanded comments. I primarily want to see whether we can address the issue without changing the default format as there are a lot of existing checkpoints in the wild.

Reimplementing pickle in a safe subset is really nice (although I don't really understand why going that route is necessary since we could have a cleaner format because a breaking change IS necessary).

I wouldn't go that route from scratch, but such implementation already exists in PyTorch because of jit.save. My point is that we can just reuse it. And since it's C++, it should be embeddable in other languages if needed.

Also just to mention, that as long as pytorch enables loading old formats then the format will be unsafe regardless of the claims of the new format. This is the big breaking change necessary for PyTorch.

Pytorch 1.6 is more than 2 years old, so most checkpoints in the wild probably use the new format. Perhaps you can run an analysis on HF Hub to check. I think making a breaking change to not load the old format by default (with one release for deprecation warning) is viable.

Zip file themselves are not really safe despite being ubiquitous Zip bomb.

TIL that there are non-recursive zip bombs :) We could enforce no compression on reading time in "safe" mode, that's just an assert on the central directory header

That's one way to go about it. I really cannot wrap my head around why this seems better to reimplement pickle than using a dead simple format like the one proposed here (it's probably not perfect, but it's trivial to implement in any language).

As @soumith pointed out on another thread, we'd need to add view/storage sharing at least. And then you might have checkpoints with nested structs, e.g. serializing the whole dict like '{'model1': model1.state_dict(), 'model': model2.state_dict()}' (I've seen it in the wild) so your format would need to add dicts/tuples/arrays likely. It quickly becomes a bigger undertaking.

Narsil commented 2 years ago

Pytorch 1.6 is more than 2 years old, so most checkpoints in the wild probably use the new format. Perhaps you can run an analysis on HF Hub to check. I think making a breaking change to not load the old format by default (with one release for deprecation warning) is viable.

Most of it yes for sure. But most of the tensors in the HUB were created from transformers and don't need (as far I know) neither nested tensors, nor other structs (dict, tuple, or anything).

TIL that there are non-recursive zip bombs :) We could enforce no compression on reading time in "safe" mode, that's just an assert on the central directory header

To be fair, this is by far a lower security issue than arbitrary code execution. (It's just since we're talking about breaking stuff, we might as well check we don't leave easy to fix holes like this one).

As @soumith pointed out on another thread, we'd need to add view/storage sharing at least. And then you might have checkpoints with nested structs, e.g. serializing the whole dict like '{'model1': model1.state_dict(), 'model': model2.state_dict()}' (I've seen it in the wild) so your format would need to add dicts/tuples/arrays likely. It quickly becomes a bigger undertaking.

Self referencing are not the hardest to add to be honest (It's a slicing operation into an existing tensor) but do add a bit of bloat in the spec and code. You could now create slice of slice of slices, slice loops and such, and checking that each nested slice is valid might be cumbersome. Not to mention loading all of these dynamically which is an important feature I think is lacking from current PT load and really necessary for distributed work. (If you load only the slice, you need only to create the slice buffer, but if you load both the slice and the parent, you want to recreate the shared structure).

Serializing Python objects is something that definitely make porting and sharing tensors harder. transformers uses config.json (so a pure JSON file) for most of the use cases I can think of. I can definitely understand why PyTorch being a Python first library would want to keep that feature though. Do you support all those cases in your CPP implementation ? Are tuples casted to cpp tuples at runtime ?

dzhulgakov commented 2 years ago

You could now create slice of slice of slices, slice loops and such...

Hm, it's single-level: i.e. you have tensors (with strides + offset) and storages. So there's no extra indirections or loops.

Not to mention loading all of these dynamically which is an important feature I think is lacking from current PT load and really necessary for distributed work.

I think the right solution for it would be to do "lazy loading". I.e. load the python data structure, filter it as needed, and then have "materialize data" that can load parameter chunks as needed. Yes, it's an extra complexity on implementation but it'd be working smoothly from the user perspective.

Serializing Python objects is something that definitely make porting and sharing tensors harder. transformers uses config.json (so a pure JSON file) for most of the use cases I can think of.

Curious what are the use cases for porting/sharing? Like a transfer of weights to a different runtime?

I can definitely understand why PyTorch being a Python first library would want to keep that feature though. Do you support all those cases in your CPP implementation ? Are tuples casted to cpp tuples at runtime ?

We can't cast to C++ tuples because that'd require static type knowledge. But there are equivalent mini-implementation in C++ of a subset of CPython object model: https://pytorch.org/cppdocs/api/structc10_1_1_i_value.html (lists/tuples/dicts). It could be separated from the rest of PyTorch too if needed.

Narsil commented 2 years ago

So there's no extra indirections or loops.

There is if you want to not force to have a single buffer of data in the case of lazy loads. AFAIK, currently when loading from pickle, Python creates the single byte buffer and then tensors start holding references to that single buffer with strides and offsets. If you need only the sliced tensor, you're still going to allocate the whole byte buffer and use stride of offsets into it meaning you're allocating way too much.

So if you want to be able to load only part of the tensor, then you cannot load that whole buffer (that would defeat the purpose of lazy loading which is to allocate LESS, not more. That means you're potentially creating a new byte buffer with only the sliced data and then other tensors could be referencing to that data. And you might create later of new tensor with a shared part of the buffer but since you already created the other buffer, then you lost the occasion to share it.

Maybe there's a way to make an API to delay the loading until all desired tensors are declared, but then there would be a lot of code to figure out the "smallest" possible alloc. And I'm not even sure it's a well posed definition.

That's why IMO lazy loading is incompatible with tensor referencing others.

Curious what are the use cases for porting/sharing? Like a transfer of weights to a different runtime?

I've used https://github.com/LaurentMazare/tch-rs/ in the Past to avoid hitting the GIL for instance. (and we don't have good cpp tokenizers bindings, and I just enjoy rust). We've been contacted also by some users loading/interacting with weights in Elixir, Ruby, JS at least. I'm not sure if those were actually using torch/Pytorch wrappers or different libraries tbh, but I'm sure exporting the models to other frameworks than Pytorch to get a common format readable from another language is something we advocate quite regularly.

And that comes with its own share of issue, where scripting/tracing/onnx exports don't necessarily support all the ops and if you care only about the weights and not the graph it's just cumbersome.

We can't cast to C++ tuples because that'd require static type knowledge. But there are equivalent mini-implementation in C++ of a subset of CPython object model: https://pytorch.org/cppdocs/api/structc10_1_1_i_value.html (lists/tuples/dicts). It could be separated from the rest of PyTorch too if needed.

Thanks for sharing !

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.