irthomasthomas commented 2 months ago

[ ] RichardAragon/MultiAgentLLM

RichardAragon/MultiAgentLLM

DESCRIPTION: "Multi Agent Language Learning Machine (Multi Agent LLM)

(Update) 1/20/2024: Global Fine Tuned Phi Model Located Here: PhiGlobalFineTunedAgent

I can complete the second fine tunes for the individual agent actors (Planner, Caller, Observer). I cannot complete the fine tuning process and upload the completed models to HuggingFace due to compute limits. I need more GPUs :(

(Update) 1/19/2024: All datasets available at HuggingFace Repo: TuringsSolutions

Introduction

Welcome to the official repository for the Multi Agent LLM, a faithful recreation of the Small LLMs Are Weak Tool Learners: A Multi-LLM Agent research paper framework under the MIT Open Source License. Our aim is to fine-tune distinct Planner, Caller, and Summarizer Agents capable of completing complex tasks efficiently. This will be completed utilizing 3 Tiny Llama models. Datasets included from the research paper and from the Gorilla release will be utilized to create further synthetic datasets to train the 3 distinct models.

Getting Started

Prerequisites

Python >= 3.7
Necessary dependencies can be found in requirements.txt

Installation

Clone the repository

git clone https://github.com/<your-username>/multiagentllm.git
cd multiagentllm

Set up virtual environment and activate it

python3 -m venv env
source env/bin/activate  # On Windows, run .\\env\\Scripts\\activate

Install dependencies
```
pip install -r requirements.txt
```

Multi Agent LLM Methodology Overview

Our project introduces the Multi Agent LLM framework, built around Large Language Models (LLMs) to handle complex tasks involving tool usage and decision-making processes. This framework draws inspiration from the ReACT framework (Yao et al., 2022) and is aimed at addressing challenges faced by Single LLM solutions for Tool Learning tasks.

Three main modules constitute the Multi Agent LLM architecture: Planner (M_plan), Caller (M_call), and Summarizer (M_sum). By dividing labor amongst the agents and dedicating a specific LLM for each sub-task, we enhance the overall effectiveness of tackling complex problems involving task planning, tool selection, and result summarization.

The α-UMi Framework workflow commences when the user inputs a query (). The Planner module creates a rationale () guiding the upcoming step. According to , the procedure advances to the caller being triggered to interact with the tools and collect observations. After gathering enough information, the planner shifts control over to the Summarizer module, which forms the final response for the user. In contrast, if the instruction remains unresolved, the system abandons the attempt.

Each module plays a distinctive role:

Planner: Acting as the brain, the Planner receives the user instruction, prior execution trajectory (), and system prompt () and outputs a rationale (): \$\$ rt = M{plan}(P{plan},\ tau{t-1},\ q)\$\$
- If indicates the necessity for tool involvement, the cycle progresses to the Caller module.
- If suggests transitioning to the final answer, the flow moves toward the Summarizer.
- If proposes aborting due to insufficient resolution possibilities, the system closes down.
Caller: Trained to concentrate solely on producing proper tool interaction commands, the Caller accepts user instruction and preceding execution trajectory (). With guidance from , the Caller delivers the requested action ():
Summarizer: Dedicated to crafting the final meaningful response, the Summarizer obtains the user instruction, former execution trajectory (), , and final rationale (), delivering the final answer ():

Global-to-Local Progressive Fine-Tuning Strategy

We propose a novel two-stage fine-tuning technique – Global-to-Local Progressive Fine-Tuning (GLPFT) – applied to α-UMi framework modules. GLPFT ensures effective fine-tuning adapted to specific roles. Initially, a shared base LLM experiences global fine-tuning on generic large datasets. Following this, specializations occur during local fine-tuning, fine-tuning subsets aligned with the roles and duties assumed by the dedicated modules. Additional details concerning data organization and prompt adaptation appear in Appendex A.

Here are the steps to create the global fine-tuning dataset for training the backbone LLM:

Collect execution trajectories: Gather historical conversations with sequences of rationale, actions, observations, and answers.
Keep trajectories intact: Do not break down or segment the trajectories. Keep each trajectory as one long sequence.
Format each trajectory:
- User instruction
- Rationale 1: [Text] Action 1: [Text] Observation 1: [Text]
- Rationale 2: [Text] Action 2: [Text] Observation 2: [Text]
- ...
- Answer: [Text]
Duplicate user instructions: Have multiple trajectories for the same user instruction, but with different action sequences and answers.
Prompts: Use a simple prompt that provides the user instruction and asks the model to generate the rationale, action, observation, answer sequence.
Target output: The entire trajectory sequence from rationale 1 to final answer.

So in summary - keep full trajectories together as one long target text, have multiple variants per user instruction, and do not differentiate between sub-tasks at this stage. Let me know if you need any other details!

Here is a methodology you can follow to create the training data needed for the Planner agent, based on the approach outlined in the research paper:

Collect execution trajectories: Gather a set of historical conversations where tools were used to solve problems. These trajectories should show the full sequence of rationale, actions, observations, and answers.
Segment trajectories: Break down each execution trajectory into individual steps. Extract the rationale, action, observation, and answer for each step.
Global fine-tuning data: For the first stage of training, do not differentiate between sub-tasks. Keep the rationale, action, and answer together in sequence for each step. This data will be used for global fine-tuning of the backbone LLM to give it a comprehensive understanding.
Local fine-tuning data (Planner): For the second stage of training, extract only the rationale from each step. The format should be:
- Previous trajectory Rationale: [Planner rationale] Next: Caller/Summarizer/Give up
Duplicate training instructions: Re-use the same training instructions from the global fine-tuning stage. But change the format of the target output to be rationale-only.
Tailor prompts: Use a prompt for the Planner that focuses its task on generating the next rationale and high-level plan, as outlined in Appendix A.1 in the paper.

Here is an example row from the global fine-tuning dataset:

Book me a flight from New York to Los Angeles next Tuesday

Rationale 1: We need to first check flight availability between the given cities on the requested date before booking.
Action 1: invoke_api_flight_search Observation 1: { \"flights\": [{\"airline\": \"Delta\", \"departure_time\": \"9:00am\", \"arrival_time\": \"12:00pm\"}, {\"airline\": \"United \", \"departure_time\": \"10:30am\", \"arrival_time\": \"1:30pm\"}]}
Rationale 2: There are two flight options on Tuesday from New York to Los Angeles. Let's select the cheaper Delta flight that departs at 9am and arrives at noon.
Action 2: invoke_api_flight_booking Observation 2: {\"booking_confirmed\": true, \"pnr\": \"ABC123\"}
Answer: I have successfully booked you on the 9am Delta flight from New York to Los Angeles next Tuesday. Your booking reference number is ABC123. Please let me know if you need any other details about your flight.

So in this example, the full trajectory from the initial user query to the final flight booking is kept intact as one sequence, which the model will be trained to generate end-to-end. Here is an example training set row for fine-tuning the Planner specifically:

Book me the cheapest flight from San Francisco to Seattle tomorrow

Rationale 1: We first need to check flight prices for the given route to find the cheapest option. Next: Caller
Rationale 2: The Delta flight for \$200 seems to be the cheapest choice. Let's book that. Next: Caller
Rationale 3: Booking confirmed. We have completed the user request. Next: Summarizer

In this example, only the rationales from the Planner are kept as the target text. The actions, observations, and answers are removed. The format is updated to have the \"Next: Caller/Summarizer\" appended to each rationale. And the prompt provides some context about the user's flight booking request.

This structures the data specifically for fine-tuning the Planner's ability to generate helpful rationales and decide the next high-level step.

References

Yujia Li, Jason Phang, Justin Fu, Zichao Yang, Yutian Chen, Jieyu Zhao, Hao Tan, Furu Wei, Mohammad Norouzi, Denny Britz, Yuqing Song, Quoc V. Le, William W. Cohen, Ruslan Salakhutdinov, and Noah A. Smith. 2023. prefix-tuning: Optimizing continuations for efficient few-shot generation. arXiv preprint arXiv:2301.02114.
Yao, Danqi, et al. \"ReACT: Reinforced active conversational tree Search via Deep Reinforcement Learning.\" Proceedings of EMNLP 2022. Association for Computational Linguistics, 2022.

This project is distributed under the terms of the MIT Open Source License. Contributions are welcome! Please feel free to submit Pull Requests and report issues. Refer to CONTRIBUTING for guidelines."

Suggested labels

irthomasthomas commented 2 months ago

Related issues

628: LLaVA/README.md at main · haotian-liu/LLaVA

### Details

Similarity score: 0.89 - [ ] [LLaVA/README.md at main · haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/README.md?plain=1) # LLaVA/README.md at main · haotian-liu/LLaVA ## 🌋 LLaVA: Large Language and Vision Assistant *Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.* [📢 LLaVA-NeXT Blog](https://llava-vl.github.io/blog/2024-01-30-llava-next/) [Project Page](https://llava-vl.github.io/) [Demo](https://llava.hliu.cc/) [Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) 🤝Community Contributions: [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) [Colab](https://github.com/camenduru/LLaVA-colab) [🤗Space](https://huggingface.co/spaces/badayvedat/LLaVA) [Replicate](https://replicate.com/yorickvp/llava-13b) [AutoGen](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) [BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA) **Improved Baselines with Visual Instruction Tuning** [Paper](https://arxiv.org/abs/2310.03744) [HF](https://huggingface.co/papers/2310.03744)
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee **Visual Instruction Tuning** (NeurIPS 2023, Oral) [Paper](https://arxiv.org/abs/2304.08485) [HF](https://huggingface.co/papers/2304.08485)
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution) ## Release - [1/30] 🔥 LLaVA-NeXT (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the [blog post](https://llava-vl.github.io/blog/2024-01-30-llava-next/), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). Training/eval data and scripts coming soon. - [11/10] [LLaVA-Plus](https://llava-vl.github.io/llava-plus/) is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [Project Page](https://llava-vl.github.io/llava-plus/) [Demo](https://llavaplus.ngrok.io/) [Code](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase) [Paper](https://arxiv.org/abs/2311.05437) - [11/2] [LLaVA-Interactive](https://llava-vl.github.io/llava-interactive/) is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [Project Page](https://llava-vl.github.io/llava-interactive/) [Demo](https://llavainteractive.ngrok.io/) [Code](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo) [Paper](https://arxiv.org/abs/2311.00571) - [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts) (script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA. - [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [🤗 Demo](https://huggingface.co/spaces/etri-vilab/Ko-LLaVA) - [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo. The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here. - [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project LLavA-RLHF. - [9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.

More

- [11/6] Support Intel dGPU and CPU platforms. More details here. - [10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support! - [10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here! - [10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5. - [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants".

- [7/19] We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo! - [6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out Slides Notes YouTube Bilibli. - [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here. - [6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page. - [5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details. - [5/2] We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details. - [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here. - [4/17] We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.

[Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) **Usage and License Notices**: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations. ## Contents - [Install](#install) - [LLaVA Weights](#llava-weights) - [Demo](#Demo) - [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) - [Dataset](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) - [Train](#train) - [Evaluation](#evaluation) #### Suggested labels ####

333: Paper Digest: NeurIPS-2023 Highlights (Full List)

### Details

Similarity score: 0.89 - [ ] [Paper Digest: NeurIPS-2023 Highlights (Full List)](https://www.paperdigest.org/data/neurips-2023-full.html) Paper Digest: NeurIPS 2023 Highlights https://www.paperdigest.org 1, Toolformer: Language Models Can Teach Themselves to Use Tools Timo Schick; Jane Dwivedi-Yu; Roberto Dessi; Roberta Raileanu; Maria Lomeli; Eric Hambro; Luke Zettlemoyer; Nicola Cancedda; Thomas Scialom; Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we show that LMs can teach themselves to *use external tools* via simple APIs and achieve the best of both worlds. 2, Self-Refine: Iterative Refinement with Self-Feedback Aman Madaan; Niket Tandon; Prakhar Gupta; Skyler Hallinan; Luyu Gao; Sarah Wiegreffe; Uri Alon; Nouha Dziri; Shrimai Prabhumoye; Yiming Yang; Shashank Gupta; Bodhisattwa Prasad Majumder; Katherine Hermann; Sean Welleck; Amir Yazdanbakhsh; Peter Clark; Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. 3, Vicuna Evaluation: Exploring LLM-as-a-Judge and Chatbot Arena Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; Siyuan Zhuang; Zhanghao Wu; Yonghao Zhuang; Zi Lin; Zhuohan Li; Dacheng Li; Eric Xing; Hao Zhang; Joseph Gonzalez; Ion Stoica; Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. #### Suggested labels #### { "key": "LLM-Applications", "value": "Topics related to practical applications of Large Language Models in various fields" }

317: Streaming-llm: Efficient Streaming Language Models with Attention Sinks

### Details

Similarity score: 0.89 - [ ] [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) # Efficient Streaming Language Models with Attention Sinks [[paper](http://arxiv.org/abs/2309.17453)] [[slides](assets/StreamingLLM.pdf)][[video](https://youtu.be/hvJsEzP34o8)] ![schemes](figures/schemes.png) https://github.com/mit-han-lab/streaming-llm/assets/40906949/2bd1cda4-a0bd-47d1-a023-fbf7779b8358 ## TL;DR We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. ## News - [2024/01] [SwiftInfer](https://github.com/hpcaitech/SwiftInfer), a TensorRT-based implementation makes StreamingLLM more production-grade. - [2024/01] StreamingLLM is integrated into NVIDIA [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-streamingllm)! - [2023/12] StreamingLLM enables endless and efficient LLM generation on [iPhone](https://x.com/davidpissarra/status/1735761373261427189?s=20)! - [2023/12] StreamingLLM is integrated by HuggingFace Transformers' [main branch](https://github.com/huggingface/transformers/pull/26681). - [2023/10] StreamingLLM is integrated into [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). - [2023/10] Check out [Attention Sinks](https://github.com/tomaarsen/attention_sinks), a third-party implementation to enable StreamingLLM on more Huggingface LLMs. ## Abstract Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. ## Usage ### Environment Setup ```bash conda create -yn streaming python=3.8 conda activate streaming pip install torch torchvision torchaudio pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece python setup.py develop ``` ### Run Streaming Llama Chatbot ```bash CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming ``` ## FAQ 1. **What does "working on infinite-length inputs" imply for LLMs?** Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. 2. **Is the context window of LLMs expanded?** No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. 3. **Can I input an extensive text, like a book, into StreamingLLM for summarization?** While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful. As emphasized earlier, we neither expand the LLMs' context window nor enhance their long-term memory. StreamingLLM's strength lies in generating fluent text from recent tokens without needing a cache refresh. 4. **What is the ideal use case for StreamingLLM?** StreamingLLM is optimized for streaming applications, such as multi-round dialogues. It's ideal for scenarios where a model needs to operate continually without requiring extensive memory or dependency on past data. An example is a daily assistant based on LLMs. StreamingLLM would let the model function continuously, basing its responses on recent conversations without needing to refresh its cache. Earlier methods would either need a cache reset when the conversation length exceeded the training length (losing recent context) or recompute KV states from recent text history, which can be time-consuming. 5. **How does StreamingLLM relate to recent works on context extension?** StreamingLLM is orthogonal to recent context extension methods and can be integrated with them. In StreamingLLM's context, "context extension" refers to the possibility of using a larger cache size to store more recent tokens. For a practical demonstration, refer to Figure 9 in our paper, where we implement StreamingLLM with models like LongChat-7B-v1.5-32K and Llama-2-7B-32K-Instruct. ## TODOs We will release the code and data in the following order, please stay tuned! - [x] Release core code of StreamingLLM, including Llama-2, MPT, Falcon, and Pythia. - [x] Release perplexity evaluation code - [x] Release Streaming Llama Chatbot demo. - [ ] Release StreamEval dataset and evaluation code. ## Citation If you find StreamingLLM useful or relevant to your project and research, please kindly cite our paper: ```bibtex @article{xiao2023streamingllm, title={Efficient Streaming Language Models with Attention Sinks}, author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike}, journal={arXiv}, year={2023} } ```

332: streaming-llm: Efficient Streaming Language Models with Attention Sinks

### Details

Similarity score: 0.88 > **Note: Efficient Streaming Language Models with Attention Sinks** > > [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) > > **TL;DR** > > We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. > > **News** > > - [2023/10] StreamingLLM is integrated into Intel Extension for Transformers. > - [2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs. > > **Abstract** > > Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. > > **Usage** > > **Environment Setup** > > ``` > conda create -yn streaming python=3.8 > conda activate streaming > > pip install torch torchvision torchaudio > pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece > > python setup.py develop > ``` > > **Run Streaming Llama Chatbot** > > ``` > CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming > ``` > > **FAQ** > > **What does "working on infinite-length inputs" imply for LLMs?** > > Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. > > **Is the context window of LLMs expanded?** > > No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. > > **Can I input an extensive text, like a book, into StreamingLLM for summarization?** > > While you can input a lengthy text, the model will only recognize the latest tokens.

657: Finetuning LLMs for ReAct. Unleashing the power of finetuning to… | by Pranav Jadhav | Feb, 2024 | Towards AI

### Details

Similarity score: 0.88 - [ ] [Finetuning LLMs for ReAct. Unleashing the power of finetuning to… | by Pranav Jadhav | Feb, 2024 | Towards AI](https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc) # Finetuning LLMs for ReAct **Description:** Finetuning LLMs for ReAct Unleashing the power of finetuning to improve multi-hop question-answering ability in LLMs. **Author:** Pranav Jadhav **Published in:** Towards AI **Reading Time:** 14 min read **Published:** 6 days ago **Views:** 71 ![Image](https://unsplash.com/photos/XXXXX) In this article, I will share my findings in benchmarking and finetuning open-source language models for ReAct (Reasoning + Acting). I demonstrate that finetuning can dramatically improve the accuracy of LLMs in answering multi-hop questions using ReAct. I also present a new dataset that can be used to finetune models for the ReAct format presented by the original paper (Yao et al., 2022). My findings indicate that, through finetuning, open-source LLMs show promise for making agents that can effectively reason and use tools. **Language Models Reasoning?** Since ChatGPT started the language model gold rush, we’ve been consistently surprised by the abilities of these neural networks to imitate our speech and writing. However, a key component of intelligence that distanced these models from ourselves was reasoning. The reasoning barrier first faltered when chain-of-thought (CoT) prompting was introduced by Wei et al. in 2022. They found that simply prompting the language model to “think step by step” and output intermediate reasoning steps improved accuracy on question-answering tasks. However, the reasoning ability of LLMs didn’t end there. Another development in reasoning was chain-of-thought with self-consistency (CoT-SC), where multiple reasoning traces were generated and the majority answer is returned as the final answer (Wang et al., 2022). Then in late 2022, a team of researchers from Princeton University and Google Research published a paper called ReAct: Synergizing Reasoning and Acting in Language Models. In this paper, the team introduces a method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer. **What is ReAct?** Simply put, ReAct is a prompting strategy to force an LLM to “reason” about what it is doing and interact with tools using actions. I will give a basic explanation here, but for a deep dive, I recommend looking at the blog post or the paper. [Read More](https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc) #### Suggested labels #### {'label-name': 'ReAct-Prompting', 'label-description': 'Describes the method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer', 'gh-repo': 'https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc', 'confidence': 63.39}

494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

### Details

Similarity score: 0.88 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration) # Awesome-Efficient-LLM A curated list for [Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM): - [Knowledge Distillation](#knowledge-distillation) - [Network Pruning](#network-pruning) - [Quantization](#quantization) - [Inference Acceleration](#inference-acceleration) - [Efficient MOE](#efficient-moe) - [Text Compression](#text-compression) - [Low-Rank Decomposition](#low-rank-decomposition) - [Hardware/System Tuning](#hardwareSystem-tuning) - [Survey](#survey) - [Leaderboard](#leaderboard) - [🚀 Updates](#updates) - [Contributing](#contributing) --- ## Inference Acceleration - … - [Add your paper here](https://github.com/horseee/Awesome-Efficient-LLM/blob/main/generate_item.py), [generate the required format](https://github.com/horseee/Awesome-Efficient-LLM#decontributing), and submit a pull request. --- ## Updates - **Sep 27, 2023:** Add tag for papers accepted at NeurIPS'23. - **Sep 6, 2023:** Add a new subdirectory `project/` to organize those projects designed for developing a lightweight LLM. - **July 11, 2023:** Create a new subdirectory `efficient_plm/` for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs. --- ## Contributing If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in `generate_item.py` and execute `python generate_item.py`. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience. - URL: [https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration) #### Suggested labels #### { "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

irthomasthomas / undecidability

MultiAgentLLM a faithful recreation of the Small LLMs Are Weak Tool Learners: A Multi-LLM Agent research paper #681

RichardAragon/MultiAgentLLM

Introduction

Getting Started

Prerequisites

Installation

Multi Agent LLM Methodology Overview

Global-to-Local Progressive Fine-Tuning Strategy

References

Suggested labels

Related issues

628: LLaVA/README.md at main · haotian-liu/LLaVA

333: Paper Digest: NeurIPS-2023 Highlights (Full List)

317: Streaming-llm: Efficient Streaming Language Models with Attention Sinks

332: streaming-llm: Efficient Streaming Language Models with Attention Sinks

657: Finetuning LLMs for ReAct. Unleashing the power of finetuning to… | by Pranav Jadhav | Feb, 2024 | Towards AI

494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

Related content

681 - Similarity score: 1.0

333 - Similarity score: 0.89

706 - Similarity score: 0.89

628 - Similarity score: 0.88

317 - Similarity score: 0.88

332 - Similarity score: 0.88