huggingface / safetensors

Simple, safe way to store and distribute tensors
https://huggingface.co/docs/safetensors
Apache License 2.0
2.89k stars 199 forks source link

What's the benefit compared to npy? #103

Closed rom1504 closed 10 months ago

rom1504 commented 1 year ago

answer is yes for zero copy and lazy loading for numpy in the table

Narsil commented 1 year ago

Exactly.

Also numpy doesn't support bfloat16 unfortunately.

rom1504 commented 1 year ago

So the main reason is bfloat16 support in numpy then I figure they would be open to a pull request

Narsil commented 1 year ago

And lazy loading and zero-copy.

All 3 are reasons.

Actually a very minor one is the use of zip, which can be also abused by zip-bombs. (And zipping never can compress tensors from ML so it brings no value, it's actually detrimental to load speeds a lot if things are actually compressed)

julien-c commented 1 year ago

also trivial format spec (which leads to trivial parser implementation in any language) including random-access reading

npy format spec doesn't seem too complex: https://stackoverflow.com/a/4090115/593036 but still more than this one :)

rom1504 commented 1 year ago

Numpy supports lazy loading via memmap And it supports zero copy by default since it's the name format in memory and on disk

Narsil commented 1 year ago

Not the npz files as far as I know: https://stackoverflow.com/questions/29080556/how-does-numpy-handle-mmaps-over-npz-files

rom1504 commented 1 year ago

Yes definitely npz do not support those since they're compressed on disk I meant the npy files

Narsil commented 1 year ago

Well, having a single file for entire ML models is usually more practical than handling every array separately.

So .npy doesn' t really fit the bill to be useful in ML, only .npz is a contender (and has the 3 aforementioned missing features).

keturn commented 1 year ago

see also #20

as for mmap: numpy.savez uses a non-compressed zip archive by default. I haven't looked in to the details of non-compressed zips, but if that just leaves the content in a contiguous blob you should be able to mmap it, right?

Narsil commented 1 year ago

You are more than welcome to implement memory mapping for npz when it's uncompressed.

But you are breaking the format (which can and should accept compressed values, regardless of the defaults for savez). But still probably a nice addition over at numpy if possible.

Other things for memory mapping and zero copy, you cannot memory map files which are saved in a different endianness and/or row order than the host machine/target tensor.

rom1504 commented 1 year ago

I think highlighting that safetensor is a better format for the use case of storing a collection of tensors will make the point stronger.

For a single tensor, npy seems to fit the bill, but indeed for many it doesn't.

So for example for the use case of storing a lot of embeddings coming from an encoder model (say because you want to build a knn from them or use them for building a classifier), npy can work well as it's a single tensor (with a batch dimension)

For the use case of storing many different tensors as is the case of ml models, having a new format sounds great!

While we're on this topic: did you know there is no good format to store both embeddings and metadata? If that might inspire some ideas...

kklingeman commented 1 year ago

Numpy supports lazy loading via memmap And it supports zero copy by default since it's the name format in memory and on disk

assuming everything is memmapped.... that is quite the constraint there....

anyhow.... Fastor C++ is a great way to go.... underneath the covers to your Fastor.py

https://romanpoya.medium.com/a-look-at-the-performance-of-expression-templates-in-c-eigen-vs-blaze-vs-fastor-vs-armadillo-vs-2474ed38d982

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.