S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe

Related issues

505: LoRAX: Dynamic loading and optimized inference of LoRA adapter models.

### Details

Similarity score: 0.92 - [ ] [LoRAX Docs](https://predibase.github.io/lorax/?h=cpu#features) # LoRAX Docs ###### Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs 📖 **What is LoRAX?** LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. 🌳 **Features** - 🚅 Dynamic Adapter Loading: include any fine-tuned LoRA adapter in your request, it will be loaded just-in-time without blocking concurrent requests. - 🏋️‍♀️ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters. - 🧁 Adapter Exchange Scheduling: asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system. - 👬 Optimized Inference: high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels (flash-attention, paged attention, SGMV), quantization, token streaming. - 🚢 Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. - 🤯 Free for Commercial Use: Apache 2.0 License. Enough said 😎. URL: [https://predibase.github.io/lorax/?h=cpu#features](https://predibase.github.io/lorax/?h=cpu#features) #### Suggested labels #### { "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }

408: llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp

### Details

Similarity score: 0.85 - [ ] [llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md) ### Llama Benchmarking Tool This is a performance testing tool for llama.cpp. It allows you to test the performance of the library with different models, prompt processing batch sizes, number of threads, number of layers offloaded to the GPU, and output formats. #### Table of Contents - [Syntax](#syntax) - [Examples](#examples) - [Text generation with different models](#text-generation-with-different-models) - [Prompt processing with different batch sizes](#prompt-processing-with-different-batch-sizes) - [Different numbers of threads](#different-numbers-of-threads) - [Different numbers of layers offloaded to the GPU](#different-numbers-of-layers-offloaded-to-the-gpu) - [Output formats](#output-formats) #### Syntax ``` usage: ./llama-bench [options] options: -h, --help Show this help message and exit -m, --model (default: models/7B/ggml-model-q4_0.gguf) -p, --n-prompt (default: 512) -n, --n-gen (default: 128) -b, --batch-size (default: 512) --memory-f32 <0|1> (default: 0) -t, --threads (default: 16) -ngl N, --n-gpu-layers (default: 99) -mg i, --main-gpu (default: 0) -mmq, --mul-mat-q <0|1> (default: 1) -ts, --tensor_split -r, --repetitions (default: 5) -o, --output (default: md) -v, --verbose (default: 0) ``` Multiple values can be given for each parameter by separating them with `,` or by specifying the parameter multiple times. #### Examples * Testing the performance of the model with default settings: ``` ./llama-bench ``` * Testing the performance of the model with a specific batch size: ``` ./llama-bench -b 1024 ``` * Testing the performance of the model with a specific model file: ``` ./llama-bench -m models/7B/ggml-model-q4_1.gguf ``` * Testing the performance of the model with a specific number of prompt and generated tokens: ``` ./llama-bench -p 1024 -n 2048 ``` * Testing the performance of the model with a specific number of threads: ``` ./llama-bench -t 8 ``` * Testing the performance of the model with a specific number of layers offloaded to the GPU: ``` ./llama-bench -ngl 64 ``` * Testing the performance of the model with a specific output format: ``` ./llama-bench -o json ``` #### Text generation with different models You can test the performance of the library with different models by specifying the model file using the `-m` or `--model` option. #### Prompt processing with different batch sizes You can test the performance of the library with different batch sizes by specifying the batch size using the `-b` or `--batch-size` option. #### Different numbers of threads You can test the performance of the library with different number of threads by specifying the number of threads using the `-t` or `--threads` option. #### Different numbers of layers offloaded to the GPU You can test the performance of the library with different number of layers offloaded to the GPU by specifying the number of GPU layers using the `-ngl` or `--n-gpu-layers` option. #### Output formats The benchmarking tool supports the following output formats: - Markdown (`md`) - CSV (`csv`) - JSON (`json`) - SQL (`sql`) You can specify the output format using the `-o` or `--output` option. #### Suggested labels ####

457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

### Details
Similarity score: 0.85 - [ ] [I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1abihou/i_keep_running_out_of_memory_whats_the_biggest/) Here's the reformatted text in Markdown format: ```markdown # GPU and Model Recommendations **GPU Only:** - You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context. - If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context. **GPU + CPU:** - Use `.gguf` files to offload part of the model to VRAM. - Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low. - In that case, reduce context size and reduce bpw. The best models you can probably run now are: - OpenChat 3.5 7B at 8bpw (Use the latest version) - at 4bpw and 4K context. - If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB. **Tip from /u/Working-Flatworm-531:** - Just do not load kv in VRAM, you can use `ooba` to disable it. - Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something. - You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context. **Recommended List from /u/Working-Flatworm-531:** - Use Linux. - Overclock RAM (if possible). - Overclock CPU (if possible). - Overclock GPU. - Don't load kv cache in VRAM, instead load more layers to the VRAM. - Use smaller quants. - Use fast interface (didn't try Kobold, use Ooba). - Check RAM (should be dual channel). ``` #### Suggested labels #### { "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

153: Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog

### Details
Similarity score: 0.85 - [ ] [Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/) Stacking transformer layers to create large models results in better accuracies, few-shot learning capabilities, and even near-human emergent abilities on a wide range of language tasks. These foundation models are expensive to train, and they can be memory- and compute-intensive during inference (a recurring cost). The most popular large language models (LLMs) today can reach tens to hundreds of billions of parameters in size and, depending on the use case, may require ingesting long inputs (or contexts), which can also add expense. This post discusses the most pressing challenges in LLM inference, along with some practical solutions. Readers should have a basic understanding of transformer architecture and the attention mechanism in general. It is essential to have a grasp of the intricacies of LLM inference, which we will address in the next section.

628: LLaVA/README.md at main · haotian-liu/LLaVA

### Details
Similarity score: 0.84 - [ ] [LLaVA/README.md at main · haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/README.md?plain=1) # LLaVA/README.md at main · haotian-liu/LLaVA ## 🌋 LLaVA: Large Language and Vision Assistant *Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.* [📢 LLaVA-NeXT Blog](https://llava-vl.github.io/blog/2024-01-30-llava-next/) [Project Page](https://llava-vl.github.io/) [Demo](https://llava.hliu.cc/) [Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) 🤝Community Contributions: [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) [Colab](https://github.com/camenduru/LLaVA-colab) [🤗Space](https://huggingface.co/spaces/badayvedat/LLaVA) [Replicate](https://replicate.com/yorickvp/llava-13b) [AutoGen](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) [BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA) **Improved Baselines with Visual Instruction Tuning** [Paper](https://arxiv.org/abs/2310.03744) [HF](https://huggingface.co/papers/2310.03744)
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee **Visual Instruction Tuning** (NeurIPS 2023, Oral) [Paper](https://arxiv.org/abs/2304.08485) [HF](https://huggingface.co/papers/2304.08485)
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution) ## Release - [1/30] 🔥 LLaVA-NeXT (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the [blog post](https://llava-vl.github.io/blog/2024-01-30-llava-next/), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). Training/eval data and scripts coming soon. - [11/10] [LLaVA-Plus](https://llava-vl.github.io/llava-plus/) is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [Project Page](https://llava-vl.github.io/llava-plus/) [Demo](https://llavaplus.ngrok.io/) [Code](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase) [Paper](https://arxiv.org/abs/2311.05437) - [11/2] [LLaVA-Interactive](https://llava-vl.github.io/llava-interactive/) is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [Project Page](https://llava-vl.github.io/llava-interactive/) [Demo](https://llavainteractive.ngrok.io/) [Code](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo) [Paper](https://arxiv.org/abs/2311.00571) - [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts) (script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA. - [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [🤗 Demo](https://huggingface.co/spaces/etri-vilab/Ko-LLaVA) - [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo. The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here. - [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project LLavA-RLHF. - [9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.

More
- [11/6] Support Intel dGPU and CPU platforms. More details here. - [10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support! - [10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here! - [10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5. - [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants".

- [7/19] We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo! - [6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out Slides Notes YouTube Bilibli. - [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here. - [6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page. - [5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details. - [5/2] We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details. - [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here. - [4/17] We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.
[Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) **Usage and License Notices**: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations. ## Contents - [Install](#install) - [LLaVA Weights](#llava-weights) - [Demo](#Demo) - [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) - [Dataset](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) - [Train](#train) - [Evaluation](#evaluation) #### Suggested labels ####

332: streaming-llm: Efficient Streaming Language Models with Attention Sinks

### Details
Similarity score: 0.84 > **Note: Efficient Streaming Language Models with Attention Sinks** > > [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) > > **TL;DR** > > We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. > > **News** > > - [2023/10] StreamingLLM is integrated into Intel Extension for Transformers. > - [2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs. > > **Abstract** > > Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. > > **Usage** > > **Environment Setup** > > ``` > conda create -yn streaming python=3.8 > conda activate streaming > > pip install torch torchvision torchaudio > pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece > > python setup.py develop > ``` > > **Run Streaming Llama Chatbot** > > ``` > CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming > ``` > > **FAQ** > > **What does "working on infinite-length inputs" imply for LLMs?** > > Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. > > **Is the context window of LLMs expanded?** > > No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. > > **Can I input an extensive text, like a book, into StreamingLLM for summarization?** > > While you can input a lengthy text, the model will only recognize the latest tokens.

irthomasthomas / undecidability