Closed rom1504 closed 10 months ago
Exactly.
Also numpy doesn't support bfloat16
unfortunately.
So the main reason is bfloat16 support in numpy then I figure they would be open to a pull request
And lazy loading and zero-copy.
All 3 are reasons.
Actually a very minor one is the use of zip, which can be also abused by zip-bombs. (And zipping never can compress tensors from ML so it brings no value, it's actually detrimental to load speeds a lot if things are actually compressed)
also trivial format spec (which leads to trivial parser implementation in any language) including random-access reading
npy format spec doesn't seem too complex: https://stackoverflow.com/a/4090115/593036 but still more than this one :)
Numpy supports lazy loading via memmap And it supports zero copy by default since it's the name format in memory and on disk
Not the npz files as far as I know: https://stackoverflow.com/questions/29080556/how-does-numpy-handle-mmaps-over-npz-files
Yes definitely npz do not support those since they're compressed on disk I meant the npy files
Well, having a single file for entire ML models is usually more practical than handling every array separately.
So .npy
doesn' t really fit the bill to be useful in ML, only .npz
is a contender (and has the 3 aforementioned missing features).
see also #20
as for mmap: numpy.savez
uses a non-compressed zip archive by default. I haven't looked in to the details of non-compressed zips, but if that just leaves the content in a contiguous blob you should be able to mmap it, right?
You are more than welcome to implement memory mapping for npz
when it's uncompressed.
But you are breaking the format (which can and should accept compressed values, regardless of the defaults for savez). But still probably a nice addition over at numpy if possible.
Other things for memory mapping and zero copy, you cannot memory map files which are saved in a different endianness and/or row order than the host machine/target tensor.
I think highlighting that safetensor is a better format for the use case of storing a collection of tensors will make the point stronger.
For a single tensor, npy seems to fit the bill, but indeed for many it doesn't.
So for example for the use case of storing a lot of embeddings coming from an encoder model (say because you want to build a knn from them or use them for building a classifier), npy can work well as it's a single tensor (with a batch dimension)
For the use case of storing many different tensors as is the case of ml models, having a new format sounds great!
While we're on this topic: did you know there is no good format to store both embeddings and metadata? If that might inspire some ideas...
Numpy supports lazy loading via memmap And it supports zero copy by default since it's the name format in memory and on disk
assuming everything is memmapped.... that is quite the constraint there....
anyhow.... Fastor C++ is a great way to go.... underneath the covers to your Fastor.py
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
answer is yes for zero copy and lazy loading for numpy in the table