irthomasthomas / undecidability

13 stars 2 forks source link

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe #636

Open irthomasthomas opened 9 months ago

irthomasthomas commented 9 months ago

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe

DESCRIPTION:
S-LoRA describes a set of optimizations for running thousands of separate LLMs simultaneously on the same GPU. At OpenPipe we’ve been running S-LoRA in production since January 4th, which critically allowed us to eliminate the cold-start problem for infrequently-used models. I wanted to share some of our learnings from the implementation process here!
But first, here’s the average cold-start response time we’re seeing after enabling the S-LoRA based pipeline:

The Problem of Weights
Modern LLMs require a lot of GPU RAM. A “small” model like Mistral 7B requires 14GB of RAM just to hold the weights, in addition to the working memory required for the KV cache, which can be multiple GB for long sequences. This means that even a very beefy GPU like an A100-40GB only has room to load one or maybe two 7B LLMs in RAM at once. Quantization can reduce the required memory, but it also leads to decreased throughput, and often hurts response quality as well.
This is not really a problem if you’re using one general-purpose model for everything, and just steering its behavior via prompting. In that case you can just load up your model on one GPU and call it a day. But fine-tuning is a far more reliable way of directing model behavior than prompting. Concretely, we’ve found that 7B models fine-tuned on a good dataset consistently outperform prompted GPT-3.5 (20B parameters), and even come within striking distance of GPT-4 (1.7T parameters)!

The downside, of course, is that now you have to figure out how to serve all those task-specific fine-tuned models efficiently. Spinning up a dedicated GPU for each model is a non-starter because it leads to low GPU utilization, which is an existential issue because of how expensive GPU time is (\$2+/hr for an A100). How do we square the circle?

Serving all the models everywhere all at once
First, a bit of background: in 2021 a new fine-tuning method called LoRA was published. The key insight is that fine-tuning only a tiny fraction of the base model’s weights can give you similar results to fine-tuning all of them, since you want your fine-tuned model to keep most of the world understanding and reasoning ability of its base. The LoRA technique involves cleverly inserting extra adapter layers in a few carefully-selected locations and only fine-tuning those. These adapters are analogous to a “git diff” that encodes only the difference in weights between the base model and your fine-tune.
These adapters can be tiny. In OpenPipe’s case, our Mistral adapters are 80MB each, only 0.5% the size of the 14GB base model. This immediately points to the shape of the solution: is it possible to load many adapters from the same base model onto one GPU and use them simultaneously, efficiently?
It turns out the answer is “yes”! Two influential papers from late 2023 help define the solution.
Punica implements a clever CUDA kernel that is able to batch-process requests from many LoRA adapters simultaneously. This custom kernel is essential, because the naive approach taken by most libraries pre-Punica required swapping adapters for each request, eliminating the critical throughput increases from serving many requests in parallel.
S-LoRA builds on Punica and adds a tiered caching architecture. It dynamically stores the most-recently-used adapters in GPU RAM, less-recently-used adapters in system RAM, and the least-recently-used adapters on disk. For a typical setup with 10GB of available GPU RAM and 1TB of system RAM, S-LoRA might store 125 adapters on the GPU and over 10K in system RAM. The overhead of restoring an adapter from system RAM to the GPU is negligible in practice; an A100 has 31GB/s of interconnect bandwidth so an 80MB adapter can be transferred in 2.4ms. This can happen in parallel with serving other requests.
This actually works!
On January 4th we deployed an experimental inference pipeline based on a vLLM fork that implements the relevant optimizations. After manually moving a few models over and closely monitoring performance, we enabled the pipeline for all new models on January 10th, and began porting over old models as well.
Over the course of this transition, the average number of GPUs in use has dropped by over 70%, even as the number of requests we serve has continued increasing! Our average response time for models coming up from a cold start (ie weights not already loaded onto a GPU) decreased from 45 seconds to 1 second, giving customers a lot more flexibility to deploy many small specialist models. And ultimately, that’s exactly what we’re here to do. 🙂

URL: https://openpipe.ai/blog/s-lora

Suggested labels

{'label-name': 'GPU-Optimization', 'label-description': 'Optimizing GPU resource utilization for running multiple models efficiently on a single GPU.', 'gh-repo': 'openpipe/openpipe-ai', 'confidence': 54.2}

irthomasthomas commented 9 months ago

Related issues

505: LoRAX: Dynamic loading and optimized inference of LoRA adapter models.

### DetailsSimilarity score: 0.92 - [ ] [LoRAX Docs](https://predibase.github.io/lorax/?h=cpu#features) # LoRAX Docs ###### Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs 📖 **What is LoRAX?** LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. 🌳 **Features** - 🚅 Dynamic Adapter Loading: include any fine-tuned LoRA adapter in your request, it will be loaded just-in-time without blocking concurrent requests. - 🏋️‍♀️ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters. - 🧁 Adapter Exchange Scheduling: asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system. - 👬 Optimized Inference: high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels (flash-attention, paged attention, SGMV), quantization, token streaming. - 🚢 Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. - 🤯 Free for Commercial Use: Apache 2.0 License. Enough said 😎. URL: [https://predibase.github.io/lorax/?h=cpu#features](https://predibase.github.io/lorax/?h=cpu#features) #### Suggested labels #### { "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }

408: llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp

### DetailsSimilarity score: 0.85 - [ ] [llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md) ### Llama Benchmarking Tool This is a performance testing tool for llama.cpp. It allows you to test the performance of the library with different models, prompt processing batch sizes, number of threads, number of layers offloaded to the GPU, and output formats. #### Table of Contents - [Syntax](#syntax) - [Examples](#examples) - [Text generation with different models](#text-generation-with-different-models) - [Prompt processing with different batch sizes](#prompt-processing-with-different-batch-sizes) - [Different numbers of threads](#different-numbers-of-threads) - [Different numbers of layers offloaded to the GPU](#different-numbers-of-layers-offloaded-to-the-gpu) - [Output formats](#output-formats) #### Syntax ``` usage: ./llama-bench [options] options: -h, --help Show this help message and exit -m, --model (default: models/7B/ggml-model-q4_0.gguf) -p, --n-prompt (default: 512) -n, --n-gen (default: 128) -b, --batch-size (default: 512) --memory-f32 <0|1> (default: 0) -t, --threads (default: 16) -ngl N, --n-gpu-layers (default: 99) -mg i, --main-gpu (default: 0) -mmq, --mul-mat-q <0|1> (default: 1) -ts, --tensor_split -r, --repetitions (default: 5) -o, --output (default: md) -v, --verbose (default: 0) ``` Multiple values can be given for each parameter by separating them with `,` or by specifying the parameter multiple times. #### Examples * Testing the performance of the model with default settings: ``` ./llama-bench ``` * Testing the performance of the model with a specific batch size: ``` ./llama-bench -b 1024 ``` * Testing the performance of the model with a specific model file: ``` ./llama-bench -m models/7B/ggml-model-q4_1.gguf ``` * Testing the performance of the model with a specific number of prompt and generated tokens: ``` ./llama-bench -p 1024 -n 2048 ``` * Testing the performance of the model with a specific number of threads: ``` ./llama-bench -t 8 ``` * Testing the performance of the model with a specific number of layers offloaded to the GPU: ``` ./llama-bench -ngl 64 ``` * Testing the performance of the model with a specific output format: ``` ./llama-bench -o json ``` #### Text generation with different models You can test the performance of the library with different models by specifying the model file using the `-m` or `--model` option. #### Prompt processing with different batch sizes You can test the performance of the library with different batch sizes by specifying the batch size using the `-b` or `--batch-size` option. #### Different numbers of threads You can test the performance of the library with different number of threads by specifying the number of threads using the `-t` or `--threads` option. #### Different numbers of layers offloaded to the GPU You can test the performance of the library with different number of layers offloaded to the GPU by specifying the number of GPU layers using the `-ngl` or `--n-gpu-layers` option. #### Output formats The benchmarking tool supports the following output formats: - Markdown (`md`) - CSV (`csv`) - JSON (`json`) - SQL (`sql`) You can specify the output format using the `-o` or `--output` option. #### Suggested labels ####

457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

### DetailsSimilarity score: 0.85 - [ ] [I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1abihou/i_keep_running_out_of_memory_whats_the_biggest/) Here's the reformatted text in Markdown format: ```markdown # GPU and Model Recommendations **GPU Only:** - You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context. - If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context. **GPU + CPU:** - Use `.gguf` files to offload part of the model to VRAM. - Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low. - In that case, reduce context size and reduce bpw. The best models you can probably run now are: - OpenChat 3.5 7B at 8bpw (Use the latest version) - at 4bpw and 4K context. - If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB. **Tip from /u/Working-Flatworm-531:** - Just do not load kv in VRAM, you can use `ooba` to disable it. - Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something. - You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context. **Recommended List from /u/Working-Flatworm-531:** - Use Linux. - Overclock RAM (if possible). - Overclock CPU (if possible). - Overclock GPU. - Don't load kv cache in VRAM, instead load more layers to the VRAM. - Use smaller quants. - Use fast interface (didn't try Kobold, use Ooba). - Check RAM (should be dual channel). ``` #### Suggested labels #### { "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

153: Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog

### DetailsSimilarity score: 0.85 - [ ] [Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/) Stacking transformer layers to create large models results in better accuracies, few-shot learning capabilities, and even near-human emergent abilities on a wide range of language tasks. These foundation models are expensive to train, and they can be memory- and compute-intensive during inference (a recurring cost). The most popular large language models (LLMs) today can reach tens to hundreds of billions of parameters in size and, depending on the use case, may require ingesting long inputs (or contexts), which can also add expense.  This post discusses the most pressing challenges in LLM inference, along with some practical solutions. Readers should have a basic understanding of transformer architecture and the attention mechanism in general. It is essential to have a grasp of the intricacies of LLM inference, which we will address in the next section.

628: LLaVA/README.md at main · haotian-liu/LLaVA

### DetailsSimilarity score: 0.84 - [ ] [LLaVA/README.md at main · haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/README.md?plain=1) # LLaVA/README.md at main · haotian-liu/LLaVA ## 🌋 LLaVA: Large Language and Vision Assistant *Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.* [📢 LLaVA-NeXT Blog](https://llava-vl.github.io/blog/2024-01-30-llava-next/) [Project Page](https://llava-vl.github.io/) [Demo](https://llava.hliu.cc/) [Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) 🤝Community Contributions: [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) [Colab](https://github.com/camenduru/LLaVA-colab) [🤗Space](https://huggingface.co/spaces/badayvedat/LLaVA) [Replicate](https://replicate.com/yorickvp/llava-13b) [AutoGen](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) [BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA) **Improved Baselines with Visual Instruction Tuning** [Paper](https://arxiv.org/abs/2310.03744) [HF](https://huggingface.co/papers/2310.03744)
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee **Visual Instruction Tuning** (NeurIPS 2023, Oral) [Paper](https://arxiv.org/abs/2304.08485) [HF](https://huggingface.co/papers/2304.08485)
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution) ## Release - [1/30] 🔥 LLaVA-NeXT (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the [blog post](https://llava-vl.github.io/blog/2024-01-30-llava-next/), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). Training/eval data and scripts coming soon. - [11/10] [LLaVA-Plus](https://llava-vl.github.io/llava-plus/) is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [Project Page](https://llava-vl.github.io/llava-plus/) [Demo](https://llavaplus.ngrok.io/) [Code](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase) [Paper](https://arxiv.org/abs/2311.05437) - [11/2] [LLaVA-Interactive](https://llava-vl.github.io/llava-interactive/) is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [Project Page](https://llava-vl.github.io/llava-interactive/) [Demo](https://llavainteractive.ngrok.io/) [Code](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo) [Paper](https://arxiv.org/abs/2311.00571) - [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts) (script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA. - [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [🤗 Demo](https://huggingface.co/spaces/etri-vilab/Ko-LLaVA) - [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo. The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here. - [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project LLavA-RLHF. - [9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.
More - [11/6] Support Intel dGPU and CPU platforms. More details here. - [10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support! - [10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here! - [10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5. - [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants".

- [7/19] We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo! - [6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out Slides Notes YouTube Bilibli. - [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here. - [6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page. - [5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details. - [5/2] We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details. - [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here. - [4/17] We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.
[Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) **Usage and License Notices**: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations. ## Contents - [Install](#install) - [LLaVA Weights](#llava-weights) - [Demo](#Demo) - [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) - [Dataset](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) - [Train](#train) - [Evaluation](#evaluation) #### Suggested labels ####

332: streaming-llm: Efficient Streaming Language Models with Attention Sinks

### DetailsSimilarity score: 0.84 > **Note: Efficient Streaming Language Models with Attention Sinks** > > [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) > > **TL;DR** > > We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. > > **News** > > - [2023/10] StreamingLLM is integrated into Intel Extension for Transformers. > - [2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs. > > **Abstract** > > Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. > > **Usage** > > **Environment Setup** > > ``` > conda create -yn streaming python=3.8 > conda activate streaming > > pip install torch torchvision torchaudio > pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece > > python setup.py develop > ``` > > **Run Streaming Llama Chatbot** > > ``` > CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming > ``` > > **FAQ** > > **What does "working on infinite-length inputs" imply for LLMs?** > > Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. > > **Is the context window of LLMs expanded?** > > No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. > > **Can I input an extensive text, like a book, into StreamingLLM for summarization?** > > While you can input a lengthy text, the model will only recognize the latest tokens.