irthomasthomas / undecidability

6 stars 2 forks source link

Finetuning LLMs for ReAct. Unleashing the power of finetuning to… | by Pranav Jadhav | Feb, 2024 | Towards AI #657

Open irthomasthomas opened 7 months ago

irthomasthomas commented 7 months ago

Finetuning LLMs for ReAct

Description:
Finetuning LLMs for ReAct Unleashing the power of finetuning to improve multi-hop question-answering ability in LLMs.

Author: Pranav Jadhav
Published in: Towards AI
Reading Time: 14 min read
Published: 6 days ago
Views: 71

Image

In this article, I will share my findings in benchmarking and finetuning open-source language models for ReAct (Reasoning + Acting). I demonstrate that finetuning can dramatically improve the accuracy of LLMs in answering multi-hop questions using ReAct. I also present a new dataset that can be used to finetune models for the ReAct format presented by the original paper (Yao et al., 2022). My findings indicate that, through finetuning, open-source LLMs show promise for making agents that can effectively reason and use tools.

Language Models Reasoning?
Since ChatGPT started the language model gold rush, we’ve been consistently surprised by the abilities of these neural networks to imitate our speech and writing. However, a key component of intelligence that distanced these models from ourselves was reasoning. The reasoning barrier first faltered when chain-of-thought (CoT) prompting was introduced by Wei et al. in 2022. They found that simply prompting the language model to “think step by step” and output intermediate reasoning steps improved accuracy on question-answering tasks. However, the reasoning ability of LLMs didn’t end there. Another development in reasoning was chain-of-thought with self-consistency (CoT-SC), where multiple reasoning traces were generated and the majority answer is returned as the final answer (Wang et al., 2022). Then in late 2022, a team of researchers from Princeton University and Google Research published a paper called ReAct: Synergizing Reasoning and Acting in Language Models. In this paper, the team introduces a method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer.

What is ReAct?
Simply put, ReAct is a prompting strategy to force an LLM to “reason” about what it is doing and interact with tools using actions. I will give a basic explanation here, but for a deep dive, I recommend looking at the blog post or the paper.

Read More

Suggested labels

{'label-name': 'ReAct-Prompting', 'label-description': 'Describes the method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer', 'gh-repo': 'https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc', 'confidence': 63.39}

irthomasthomas commented 7 months ago

Related issues

333: Paper Digest: NeurIPS-2023 Highlights (Full List)

### DetailsSimilarity score: 0.88 - [ ] [Paper Digest: NeurIPS-2023 Highlights (Full List)](https://www.paperdigest.org/data/neurips-2023-full.html) Paper Digest: NeurIPS 2023 Highlights https://www.paperdigest.org 1, Toolformer: Language Models Can Teach Themselves to Use Tools Timo Schick; Jane Dwivedi-Yu; Roberto Dessi; Roberta Raileanu; Maria Lomeli; Eric Hambro; Luke Zettlemoyer; Nicola Cancedda; Thomas Scialom; Related Papers   Related Patents   Related Grants   Related Venues   Related Experts   Related Code   View Highlight: In this paper, we show that LMs can teach themselves to *use external tools* via simple APIs and achieve the best of both worlds. 2, Self-Refine: Iterative Refinement with Self-Feedback Aman Madaan; Niket Tandon; Prakhar Gupta; Skyler Hallinan; Luyu Gao; Sarah Wiegreffe; Uri Alon; Nouha Dziri; Shrimai Prabhumoye; Yiming Yang; Shashank Gupta; Bodhisattwa Prasad Majumder; Katherine Hermann; Sean Welleck; Amir Yazdanbakhsh; Peter Clark; Related Papers   Related Patents   Related Grants   Related Venues   Related Experts   Related Code   View Highlight: Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. 3, Vicuna Evaluation: Exploring LLM-as-a-Judge and Chatbot Arena Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; Siyuan Zhuang; Zhanghao Wu; Yonghao Zhuang; Zi Lin; Zhuohan Li; Dacheng Li; Eric Xing; Hao Zhang; Joseph Gonzalez; Ion Stoica; Related Papers   Related Patents   Related Grants   Related Venues   Related Experts   View Highlight: To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. #### Suggested labels #### { "key": "LLM-Applications", "value": "Topics related to practical applications of Large Language Models in various fields" }

546: [2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

### DetailsSimilarity score: 0.87 - [ ] [[2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction](https://arxiv.org/abs/2304.11015) # [2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction **DESCRIPTION:** DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction Mohammadreza Pourreza, Davood Rafiei There is currently a significant gap between the performance of fine-tuned models and prompting approaches using Large Language Models (LLMs) on the challenging task of text-to-SQL, as evaluated on datasets such as Spider. To improve the performance of LLMs in the reasoning process, we study how decomposing the task into smaller sub-tasks can be effective. In particular, we show that breaking down the generation problem into sub-problems and feeding the solutions of those sub-problems into LLMs can be an effective approach for significantly improving their performance. Our experiments with three LLMs show that this approach consistently improves their simple few-shot performance by roughly 10%, pushing the accuracy of LLMs towards SOTA or surpassing it. On the holdout test set of Spider, the SOTA, in terms of execution accuracy, was 79.9 and the new SOTA at the time of this writing using our approach is 85.3. Our approach with in-context learning beats many heavily fine-tuned models by at least 5%. Additionally, when evaluated on the BIRD benchmark, our approach achieved an execution accuracy of 55.9%, setting a new SOTA on its holdout test set. URL: [https://arxiv.org/abs/2304.11015](https://arxiv.org/abs/2304.11015) #### Suggested labels #### {'label-name': 'Text-to-SQL', 'label-description': 'Focuses on generating SQL queries from natural language text.', 'confidence': 76.74}

332: streaming-llm: Efficient Streaming Language Models with Attention Sinks

### DetailsSimilarity score: 0.85 > **Note: Efficient Streaming Language Models with Attention Sinks** > > [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) > > **TL;DR** > > We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. > > **News** > > - [2023/10] StreamingLLM is integrated into Intel Extension for Transformers. > - [2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs. > > **Abstract** > > Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. > > **Usage** > > **Environment Setup** > > ``` > conda create -yn streaming python=3.8 > conda activate streaming > > pip install torch torchvision torchaudio > pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece > > python setup.py develop > ``` > > **Run Streaming Llama Chatbot** > > ``` > CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming > ``` > > **FAQ** > > **What does "working on infinite-length inputs" imply for LLMs?** > > Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. > > **Is the context window of LLMs expanded?** > > No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. > > **Can I input an extensive text, like a book, into StreamingLLM for summarization?** > > While you can input a lengthy text, the model will only recognize the latest tokens.

314: Prompt Engineering Guide | Prompt Engineering Guide

### DetailsSimilarity score: 0.85 - [ ] [Prompt Engineering Guide | Prompt Engineering Guide](https://www.promptingguide.ai/) Prompt Engineering Guide Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs). Researchers use prompt engineering to improve the capacity of LLMs on a wide range of common and complex tasks such as question answering and arithmetic reasoning. Developers use prompt engineering to design robust and effective prompting techniques that interface with LLMs and other tools. Prompt engineering is not just about designing and developing prompts. It encompasses a wide range of skills and techniques that are useful for interacting and developing with LLMs. It's an important skill to interface, build with, and understand capabilities of LLMs. You can use prompt engineering to improve safety of LLMs and build new capabilities like augmenting LLMs with domain knowledge and external tools. Motivated by the high interest in developing with LLMs, we have created this new prompt engineering guide that contains all the latest papers, advanced prompting techniques, learning guides, model-specific prompting guides, lectures, references, new LLM capabilities, and tools related to prompt engineering.

153: Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog

### DetailsSimilarity score: 0.85 - [ ] [Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/) Stacking transformer layers to create large models results in better accuracies, few-shot learning capabilities, and even near-human emergent abilities on a wide range of language tasks. These foundation models are expensive to train, and they can be memory- and compute-intensive during inference (a recurring cost). The most popular large language models (LLMs) today can reach tens to hundreds of billions of parameters in size and, depending on the use case, may require ingesting long inputs (or contexts), which can also add expense.  This post discusses the most pressing challenges in LLM inference, along with some practical solutions. Readers should have a basic understanding of transformer architecture and the attention mechanism in general. It is essential to have a grasp of the intricacies of LLM inference, which we will address in the next section.

130: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | Papers With Code

### DetailsSimilarity score: 0.85 - [ ] [Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | Papers With Code](https://paperswithcode.com/paper/language-agent-tree-search-unifies-reasoning)
Quote While large language models (LLMs) have demonstrated impressive performance on a range of decision-making tasks, they rely on simple acting processes and fall short of broad deployment as autonomous agents. We introduce LATS (Language Agent Tree Search), a general framework that synergizes the capabilities of LLMs in planning, acting, and reasoning. Drawing inspiration from Monte Carlo tree search in model-based reinforcement learning, LATS employs LLMs as agents, value functions, and optimizers, repurposing their latent strengths for enhanced decision-making. What is crucial in this method is the use of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism that moves beyond the limitations of existing techniques. Our experimental evaluation across diverse domains, such as programming, HotPotQA, and WebShop, illustrates the applicability of LATS for both reasoning and acting. In particular, LATS achieves 94.4\% for programming on HumanEval with GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5, demonstrating the effectiveness and generality of our method.