microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.67k stars 2.78k forks source link

[Feature Request] ONNX lossy+lossless compression. #15456

Open elephantpanda opened 1 year ago

elephantpanda commented 1 year ago

Describe the feature request

Here is the proposal:

If we have an ONNX file of float16s. This might be a file of 10GB. Let's say we "quantize" these into int8's. So now the file is of size 5GB. But we don't run it as int8. When we load the InferenceSession, these weights get turned back into float16s.

So basically this is just a float16 ONNX file that is compressed on the disk. (Much like a GIF or JPEG is a compressed image file)

It would be cool to have some built in lossy and lossless compression functions specifically designed for neural networks. That would create smaller ONNX files.

If we just ZIP it with standard compressors, then this is not optimal and very bad for resources.

If this is not a good fit for the OnnxRuntime. Perhaps you might like to suggest some good compression algorithms that would work well with neural networks?

Describe scenario use case

When sharing an application people would need to download smaller files and save hard disk space. The files would be decompressed sequentially at runtime and become full size on the GPU before inference. (Using an efficient loader such as this).

float16 or float32 often work faster on GPU's than int8. So it would be nice to store the values on disk as int8 but run them on the GPU as float16.

Personally I am having problem in that my 256GB SSD is getting full of very large >5GB ONNX files. So some sort of compression would be appreciated.

Basically I am proposing a "Compressed-ONNX" format (both lossy and lossless) together with the functions to decompress it at runtime. I will probably end up implementing this myself but it is just a suggestion for you.

Ideal Solutoin

The ideal solution would be to have an onnx compressed in int8 format. And then you could choose whether to run this on the GPU as int8, float16 or float32. I think this is possible since you could read in the "DeQuantize layers" use these to convert the int8 back to float before inference. In other words there should be a DeQuantize(model) function.

elephantpanda commented 1 year ago

Actually I have thought of a workflow for this (not sure if it will work)

  1. Use quantize_dynamic(..) or quantize_static(..) on an onnx file.
  2. Make sure both the original onnx and the quantized onnx have external weights
  3. The weight files should have the same names in both cases except for some .weight renamed as .weight_quantized
  4. Load in the .weight files and the .weight_quantized to work out the scale and offset values (probably an easier way!)
  5. Store the scale/offsets in an XML file.
  6. Create weight-less versions of the original onnx and the quantized onnx files (see this script for example)
  7. At run time the user can choose either to run the model on the GPU as int8 or choose to dequantize all the weights before inference using our XML file and run on the GPU as float16 at optimum speeds.

Advantages Now instead of a 10GB float16 onnx file you have 5GB of files. Depending on the users' GPU, if they have lots of VRAM they could choose to run this fast as float16 or slower as int8. Or on the CPU faster as int8. All from that single set of 5GB files. Thus this single app could run on a range on GPUs.

Further Improvements This is an example of lossy compression. It would still be nice to include some kind of lossless compression. I think you can get a 20% size improvement just by zipping the quantized weight files. Because they are not entirely random but the values cluster around zero. (Mind you the fact that you can do this compression probably means the quantization method is not the optimum - a non-uniform quantization may be better)

Questions Is there an easier way to get the scale and offset of the quantized weights? Is there a fast function something like: Float16[] Dequantize(uint8[] data, float offset, float scale)? A guess this could be done in a shader.

wschin commented 1 year ago

lossless compression for neural network is almost impossible and a very difficult research direction. If we print out the digits of those model weights, we will see they don't have obvious patterns and therefore bad compression ratio is expected. On the other hand, the magnitudes of weights are much more concentrated. That's why we can do int8 or float16 compression without losing meaningful.

elephantpanda commented 1 year ago

lossless compression for neural network is almost impossible and a very difficult research direction. If we print out the digits of those model weights, we will see they don't have obvious patterns and therefore bad compression ratio is expected. On the other hand, the magnitudes of weights are much more concentrated. That's why we can do int8 or float16 compression without losing meaningful.

That is true but I'm more talking about compression for quantized models. Since the int8 values aren't random, they are grouped in a normal distribution around zero if they are uniformly quantized. And also contain a lot of zeros. A simple experiment shows that you can losslessly compress a quantized onnx file. What we can do is this pipeline:

float16 model --> (lossy compress) --> int8 model --> (lossless compress) --> zipped model

This is my experiment with the carebras-111 large language model:

For example I take my int8-quantized ONNX file (147MB) and zip it it becomes 111MB. That is a 32% space saving. Not so shabby. I'm not sure if there is a better or faster compression than ZIP that would work for this case. There probably is a better algorithm that could take advantage of reordering rows in the weight matrices.

When I zip up a float16 onnx file of 290MB I get a file of 262MB which is only 10% space saving. So in this case not much is saved.

For lossy compression, I have already implemented a method which simply uses the quantized weights (from a cloned and quantized model) and decompresses them back to float16 to run on the GPU. This gives a 50% space saving while keeping the speed of float16 on the GPU. If I also zip the quantized weights this gives a total of 65% space saving. Leaving files of about 1/3 the size sitting on the hard drive. Very convenient if dealing with a lot of neural networks. 😀 (In fact the int8 shouldn't run slower than float16 but that's another issue!)

My current bottlenecks are finding a fast float[]-->float16[] method. Which I might have to delegate to the GPU. And secondly finding a good lossless-compression algorithm. (But I might skip this part) as 50% is still quite good.

Only problem with my method is depending on how you save the onnx file it can give the weights different names... ☹

laclouis5 commented 8 months ago

I'm was quite surprised not to find weights compression utilities in the ONNX framework (not tu be confused with quantization, which can effect the model activations). Weights compression techniques seem quite mature IMO, yet I don't see this capability supported in many frameworks.

In my experience, compressing weights to 8 bits using palettization or linear quantization yields almost no accuracy loss for near all models I tried so far (vision models, LLM, multimodal, etc.). Even 6 bits compression doesn't incur a large accuracy loss for most models I tried. There is also new methods appearing, such as mixed-bit weights compression, that allow even higher compression ratios, still without fine-tuning or Quantization-Aware Training (QAT).

Is there a discussion somewhere where to discuss such methods for ONNX?