jozu-ai / kitops

Tools for easing the handoff between AI/ML and App/SRE teams.
https://KitOps.ml
Apache License 2.0
266 stars 26 forks source link

Improve pack and unpack speed #257

Closed amisevsk closed 1 month ago

amisevsk commented 2 months ago

Describe the problem you're trying to solve Modelkits with large files can take a long time to pack/unpack due to gzip being slow. We can speed this up but need to consider options carefully as changing the compression format will change model digests.

Describe the solution you'd like Choose another option for storage that is quicker

Describe alternatives you've considered We could also make storage type configurable (gzip, no compression, zstd) (e.g. in the Kitfile). This would lead to a situation where the same modelkit data potentially packs into modelkits with different digests, though.

Additional context Add any other context or screenshots about the feature request here.

bmicklea commented 2 months ago

We'll start with an initial 1-day spike to see what's possible / fruitful and then go from there.

amisevsk commented 1 month ago

I've pushed a branch that can be used for testing: https://github.com/jozu-ai/kitops/tree/compression-opts

In this branch, kit supports a few compression options, specified via the --compression flag for kit pack:

Each format is reflected in the mediatype for the layer as expected and automatically handled by unpack.

From testing this briefly with ghcr.io/jozu-ai/llama-2, my two main takeaways are

  1. For many models, the weights are pretty much uncompressible; this is format dependent and largely depends on any metadata included in the file (e.g. f16 has significant room for compression, q4_0 does not)
  2. Gzip is incredibly slow for uncompressible data; using the fastest compression level doesn't do much to change layer size and is generally 10x faster.
Full data (pack + unpack llama-2 7B quantizations with different compression options) quantization compression time (pack) time (unpack) size
q4_0 none 4.44s 1.94s 3.5 GiB
zstd 9.88s 3.23s 3.5 GiB
gzip 81.35s 30.46s 3.3 GiB
gzip-fastest 8.75s 3.70s 3.5 GiB
q5_0 none 4.27s 3.86s 4.3 GiB
zstd 11.91s 3.84s 4.3 GiB
gzip 96.99s 32.27s 4.3 GiB
gzip-fastest 9.60s 3.85s 4.3 GiB
q8_0 none 6.98s 6.03s 6.6 GiB
zstd 19.67s 6.16s 6.6 GiB
gzip 152.09s 57.54s 6.4 GiB
gzip-fastest 14.69s 6.28s 6.6 GiB
f16 none 12.56s 11.01s 12.5 GiB
zstd 33.37s 21.31s 9.6 GiB
gzip 247.29s 99.93s 9.6 GiB
gzip-fastest 111.36s 98.14s 9.5 GiB

The above is testing the llama-2 (7B) model with weights stored in GGUF format; it's possible that other formats will allow different compression levels (though I doubt it, since the actual weights in a model generally look like random numbers)

bmicklea commented 1 month ago

Nice work. It doesn't look like there's any point in using gzip - the compressed sizes vs zstd are nearly identical and zstd is far faster. Should we simplify things with just a no compression and zstd option?

amisevsk commented 1 month ago

The main concern we had for zstd was that it's newer and the implementation isn't a standard, so a future update could change behavior/digests. Even in the current state, I've had to use the "Better Compression" option in order to replicate what you get with the official binary with default compression (which is a little strange).

This would mean that modelkit digests are reproducible (under zstd) only on the same version of kit used to do the original pack -- I'm not sure that's a huge issue though. Different versions of kit already may produce different digests.

The discussion we were having around allowing different options was similar: if we allow e.g. none and zstd, then packing the exact same data could lead to two different digests. Again, I'm not sure this is a huge issue (we mostly care about retrieving the expected modelkit) but it's worth considering.

gorkem commented 1 month ago

Also unpack needs to be aware of the compression of the unpacked blob. Using the correct method for decompression.

amisevsk commented 1 month ago

That part is working on the branch (the mediatype identifies the compression format)

amisevsk commented 1 month ago

Opened a PR based on a stripped down version of the branch mentioned above: