huggingface / swift-transformers

Swift Package to implement a transformers-like API in Swift
Apache License 2.0
659 stars 70 forks source link

Optimization: cache past key-values #9

Open pcuenca opened 1 year ago

pcuenca commented 1 year ago

This needs to go hand in hand with changes to the conversion process in exporters and transformers-to-coreml.

antmikinka commented 1 year ago

Caching and latency issues with memory of the models is key. Found this in coremltools docs:

"Impact on Latency and Compute Unit Considerations

For a model that is primarily running on the Neural Engine, sparsity typically helps in improving latency. Firstly, it reduces the amount of weight memory to be loaded at inference time, which is beneficial for networks that are weight memory bound (note that starting from iOS17/macOS14, for ops running on the Neural Engine, sparse weights are decompressed at prediction time). In addition to that, when a relatively long string of consecutive 0s are encountered, the Neural Engine may also be able to skip computations, thereby reducing the amount of computation as well. This means choosing higher levels of sparsity (e.g. 75% or higher) can lead to more latency gains than lower levels. This also means that choosing a block structured kind of sparsity with larger block sizes may be more beneficial. However, note that it's also relatively harder to preserve accuracy with stricter constraints like larger block size and higher level of sparsity."

For a model that has a lot of linear ops and uses a specific kind of sparsity; that is, n:m such that m is a factor of 16 (such as 3:4, 7:8, 14:16, and so on), it can benefit from the CPU compute unit performance in newer hardware generations, thereby resulting in faster inference (https://coremltools.readme.io/docs/pruning-overview)

GitHub repo: ml-stable-diffusion. If I am not mistaken utilizes Pruning and Palettization of the model for better interference with apple devices. I believe for Llamav2 we'd use post-training pruning and post-training palettization. I will be working on this, but some people are faster than me, so figure I would share the information. Will follow up soon with another post.

antmikinka commented 1 year ago

This issue with caching past key-values also deals with pytorch's dataloader issue https://github.com/pytorch/pytorch/issues/13246 This is happening because of how CPython, Python Multipprocessing, and Pytorch Dataloader works.

Here is a podcast discussing the issue, how it works, and the multiple ways to fix this. https://pytorch-dev-podcast.simplecast.com/episodes/dataloader-with-multiple-workers-leaks-memory

While trying to convert a llama model or any other big model. I believe there are numerous python objects being written to and read while trying to convert these PyTorch models to CoreML. I have tried google cloud, google colab, buying more ram, different LLM sizes. All of these and none of them have worked.

The past key-values as a python list, which pytorch is trying to load from is causing multiple read/writes and then possibly more depending on how coremltools reads each layer from pytorch to convert it to coremltools.

I am going to attempt a making a couple of python lists/dicts into numpy in a couple of places and see how that goes. Hopefully, will not have to convert all lists/dicts in coremltools.

antmikinka commented 4 months ago

This may help us open the door to implementing similar code to swift-transformers

apple/ml-recurrent-drafter

Even this modeling_llama.py is implementing the split attention layers, redefining states, the four NHWC channels, kv cache and repeat_interleave op.

Wonder if we would be able to take some of this logic and apply to openelm float16 (because of the repeat_interleave op in modeling_llama.py).

Check out this issue / comment for some more information Convert OpenELM to float16 Core ML