irthomasthomas / undecidability

2 stars 2 forks source link

Fast Classifiers for Prompt Routing #626

Open irthomasthomas opened 4 months ago

irthomasthomas commented 4 months ago

classifiers/README.md

Fast Classifiers for Prompt Routing

Routing and controlling the information flow is a core component in optimizing machine learning tasks. While some architectures focus on internal routing of data within a model, we focus on the external routing of data between models. This enables the combination of open source, proprietary, API based, and software based approaches to work together behind a smart router. We investigate three different ways of externally routing the prompt - cosine similarity via embeddings, zero-shot classification, and small classifiers.

Implementation of Fast Classifiers

The code-class.ipynb Jupyter notebook walks through the process of creating a fast prompt classifier for smart routing. For the fast classifiers, we utilize the model DistilBERT, a smaller language representation model designed for efficient on-the-edge operation and training under computational constraints. DistilBERT is not only less costly to pre-train but also well-suited for on-device computations, as demonstrated through experiments and comparative studies.

We quantize the model using Optimum, enabling the model to run extremely fast on a CPU router. Each classifier takes 5-8ms to run. An ensemble of 8 prompt classifiers takes about 50ms in total. Thus, each endpoint can route about 20 requests per second.

In the example code-class, we are deciding between prompts of code and not code prompts. The two datasets used are the 52K instruction-following data generated by GPT-4 with prompts in Alpaca. And the 20K instruction-following data used for fine-tuning the Code Alpaca model.

Train test split of 80/20 yields an accuracy of 95.49% and f1 score of 0.9227. Train Test

Comparison vs other Routing methods

The most popular alternative to routing is via embedding similarity. For example, if one were to try to route a programming question, one might set up the set of target classes as ["coding", "not coding"]. Each one of these strings is then transformed into an embedding and compared against a prompt query like, "write a bubble sort in python". Given the computed pair-wise cosine similarity between the query and class, we can then label the prompt as a coding question and route the prompt to a coding-specific model. These do not scale well with larger numbers of embeddings. Nor are they able to capture non-semantic type classes (like is the response likely to be more or less than 200 tokens). However, they are adaptable and comparably fast and thus provide a good alternative to the trained fast classifiers.

Train Test

Quantifying different methods of routing in terms of execution time. As the prompt size increases, the query time also increases as shown in (a). There is also a close to linear increase in the time as the number of classes increase as shown in (b). However, the small classifiers do not increase in time as the class examples increase in the number of tokens (c). This is due to the upfront cost of training the binary classifier, reducing cost at inference.

Reproducibility

The timing_tests.js and complexity.js files can be used for reproducibility. Note that only the code classifier is currently available in this repo. One will need to install the appropriate models from the Transformers.js repo.

View on GitHub

Suggested labels

{'label-name': 'Prompt-Routing', 'label-description': 'Focuses on external routing of data between models to optimize machine learning tasks.', 'confidence': 50.24}

irthomasthomas commented 4 months ago

Related issues

498: CodeGPTPlus/deepseek-coder-1.3b-typescript · Hugging Face

### DetailsSimilarity score: 0.88 - [ ] [CodeGPTPlus/deepseek-coder-1.3b-typescript · Hugging Face](https://huggingface.co/CodeGPTPlus/deepseek-coder-1.3b-typescript) # CodeGPTPlus/deepseek-coder-1.3b-typescript This is a fine-tuned model by the CodeGPT team, specifically crafted for generating expert code in TypeScript. It is fine-tuned from `deepseek-ai/deepseek-coder-1.3b-base` with a dataset of 0.5B tokens, making it an excellent choice for precise and efficient TypeScript code generation. The model uses a 16K window size and an additional fill-in-the-middle task for project-level code completion. ## How to Use This model is for completion purposes only. Here are some examples of how to use the model: ### Running the model on a GPU ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("CodeGPTPlus/deepseek-coder-1.3b-typescript", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("CodeGPTPlus/deepseek-coder-1.3b-typescript", trust_remote_code=True).cuda() input_text = """<|fim begin|>function quickSort(arr: number[]): number[] { if (arr.length <= 1) { return arr; } const pivot = arr[0]; const left = []; const right = []; <|fim hole|> return [...quickSort(left), pivot, ...quickSort(right)]; }<|fim end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Running with Ollama - Model: [https://ollama.ai/codegpt/deepseek-coder-1.3b-typescript](https://ollama.ai/codegpt/deepseek-coder-1.3b-typescript) - Command: `ollama run codegpt/deepseek-coder-1.3b-typescript` ### Running with Ollama and CodeGPT Autocomplete in VSCode - Documentation: [https://docs.codegpt.co/docs/tutorial-features/code\_autocompletion](https://docs.codegpt.co/docs/tutorial-features/code_autocompletion) - Select "Ollama - codegpt/deepseek-coder-1.3b-typescript" in the autocomplete model selector. ### Fill In the Middle (FIM) ```python <|fim begin|>function quickSort(arr: number[]): number[] { if (arr.length <= 1) { return arr; } const pivot = arr[0]; const left = []; const right = []; <|fim hole|> return [...quickSort(left), pivot, ...quickSort(right)]; }<|fim end|> ``` ### Training Procedure The model was trained using the following hyperparameters: - learning\_rate: 2e-05 - train\_batch\_size: 20 - eval\_batch\_size: 20 - seed: 42 - gradient\_accumulation\_steps: 2 - total\_train\_batch\_size: 40 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06 - lr\_scheduler\_type: cosine - lr\_scheduler\_warmup\_steps: 261 - num\_epochs: 1 For more information, visit the [model page](https://huggingface.co/CodeGPTPlus/deepseek-coder-1.3b-typescript). #### Suggested labels #### { "label-name": "TypeScript-Code-Generation", "description": "Model for generating TypeScript code", "repo": "CodeGPTPlus/deepseek-coder-1.3b-typescript", "confidence": 70.59 }

324: bigcode/tiny_starcoder_py · Hugging Face

### DetailsSimilarity score: 0.87 > **Note:** > > [bigcode/tiny_starcoder_py · Hugging Face](https://huggingface.co/bigcode/tiny_starcoder_py) > > TinyStarCoderPy > > This is a 164M parameters model with the same architecture as StarCoder (8k context length, MQA & FIM). It was trained on the Python data from StarCoderData for ~6 epochs which amounts to 100B tokens. > > Use > > Intended use > > The model was trained on GitHub code, to assist with some tasks like Assisted Generation. For pure code completion, we advise using our 15B models StarCoder or StarCoderBase. > > Generation > > ```python > # pip install -q transformers > from transformers import AutoModelForCausalLM, AutoTokenizer > > checkpoint = "bigcode/tiny_starcoder_py" > device = "cuda" # for GPU usage or "cpu" for CPU usage > > tokenizer = AutoTokenizer.from_pretrained(checkpoint) > model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) > > inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Fill-in-the-middle > > Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: > > ```python > input_text = "def print_one_two_three():\n print('one')\n \n print('three')" > inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Training > > Model > > - Architecture: GPT-2 model with multi-query attention and Fill-in-the-Middle objective > - Pretraining steps: 50k > - Pretraining tokens: 100 billion > - Precision: bfloat16 > > Hardware > > - GPUs: 32 Tesla A100 > - Training time: 18 hours > > Software > > - Orchestration: Megatron-LM > - Neural networks: PyTorch > - BP16 if applicable: apex > > License > > The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/bigcode/tiny_starcoder_py/blob/main/LICENSE). > > #### Suggested labels > > - { "key": "llm-pretraining", "value": "Information related to the pretraining process of Large Language Models" }

326: Assisted Generation: a new direction toward low-latency text generation

### DetailsSimilarity score: 0.87 > **Assisted Generation: a new direction toward low-latency text generation** Greedy decoding with assisted generation Assisted generation is a balancing act. You want the assistant to quickly generate a candidate sequence while being as accurate as possible. If the assistant has poor quality, your get the cost of using the assistant model with little to no benefits. On the other hand, optimizing the quality of the candidate sequences may imply the use of slow assistants, resulting in a net slowdown. While we can't automate the selection of the assistant model for you, we’ve included an additional requirement and a heuristic to ensure the time spent with the assistant stays in check. First, the requirement – the assistant must have the exact same tokenizer as your model. If this requirement was not in place, expensive token decoding and re-encoding steps would have to be added. Furthermore, these additional steps would have to happen on the CPU, which in turn may need slow inter-device data transfers. Fast usage of the assistant is critical for the benefits of assisted generation to show up. Finally, the heuristic. By this point, you have probably noticed the similarities between the movie Inception and assisted generation – you are, after all, running text generation inside text generation. There will be one assistant model forward pass per candidate token, and we know that forward passes are expensive. While you can’t know in advance the number of tokens that the assistant model will get right, you can keep track of this information and use it to limit the number of candidate tokens requested to the assistant – some sections of the output are easier to anticipate than others. Wrapping all up, here’s our original implementation of the assisted generation loop (code): 1. Use greedy decoding to generate a certain number of candidate tokens with the assistant model, producing candidates. The number of produced candidate tokens is initialized to 5 the first time assisted generation is called. 2. Using our model, do a forward pass with candidates, obtaining logits. 3. Use the token selection method (.argmax() for greedy search or .multinomial() for sampling) to get the next_tokens from logits. 4. Compare next_tokens to candidates and get the number of matching tokens. Remember that this comparison has to be done with left-to-right causality: after the first mismatch, all candidates are invalidated. 5. Use the number of matches to slice things up and discard variables related to unconfirmed candidate tokens. In essence, in next_tokens, keep the matching tokens plus the first divergent token (which our model generates from a valid candidate subsequence). 6. Adjust the number of candidate tokens to be produced in the next iteration — our original heuristic increases it by 2 if ALL tokens match and decreases it by 1 otherwise. We’ve designed the API in 🤗 Transformers such that this process is hassle-free for you. All you need to do is to pass the assistant model under the new `assistant_model` keyword argument and reap the latency gains! At the time of the release of this blog post, assisted generation is limited to a batch size of 1. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch prompt = "Alice and Bob" checkpoint = "EleutherAI/pythia-1.4b-deduped" assistant_checkpoint = "EleutherAI/pythia-160m-deduped" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(checkpoint) inputs = tokenizer(prompt, return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device) outputs = model.generate(**inputs, assistant_model=assistant_model) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) # ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a'] ``` Is the additional internal complexity worth it? Let’s have a look at the latency numbers for the greedy decoding case (results for sampling are in the next section), considering a batch size of 1. These results were pulled directly out of 🤗 Transformers without any additional optimizations, so you should be able to reproduce them in your setup. **Assisted Generation Benchmark** | OPT: Open | OPT: Summ | Whisper: ARS | CodeGen: Code | Flan-T5: Summ | |------------|------------|---------------|----------------|----------------| | GPU | | | | | | Omit cases with memory offload? | Yes | No | Image | | Assistant Model | facebook/opt-125m | Model Names: | 1.3B: facebook/opt-1.3b | 6.7B: facebook/opt-6.7b | | 30B: facebook/opt-30b | 66B: facebook/opt-66b | Dataset used as input prompt: | C4 (en, validation set) | joaogante/assisted_generation_benchmarks | | built with Gradio. | Hosted on Spaces | | | | Glancing at the collected numbers, we see that assisted generation can deliver significant latency reductions in diverse settings, but it is not a silver bullet – you should benchmark it before applying it to your use case. We can conclude that assisted generation: - 🤏 Requires access to an assistant model that is at least an order of magnitude smaller than your model (the bigger the difference, the better); - 🚀 Gets up to 3x speedups in the presence of INT8 and up to 2x otherwise, when the model fits in the GPU memory; - 🤯 If you’re playing with models that do not fit in your GPU and are relying on memory offloading, you can see up to 10x speedups; - 📄 Shines in input-grounded tasks, like automatic speech recognition or summarization. **Sample with assisted generation** Greedy decoding is suited for input-grounded tasks (automatic speech recognition, translation, summarization, ...) or factual knowledge-seeking. Open-ended tasks requiring large levels of creativity, such as most uses of a language model as a chatbot, should use sampling instead. Assisted generation is naturally designed for greedy decoding, but that doesn’t mean that you can’t use assisted generation with multinomial sampling! Drawing samples from a probability distribution for the next token will cause our greedy assistant to fail more often, reducing its latency benefits. However, we can control how sharp the probability distribution for the next tokens is, using the temperature coefficient that’s present in most sampling-based applications. At one extreme, with temperatures close to 0, sampling will approximate greedy decoding, favoring the most likely token. At the other extreme, with the temperature set to values much larger than 1, sampling will be chaotic, drawing from a uniform distribution. Low temperatures are, therefore, more favorable to your assistant model, retaining most of the latency benefits from assisted generation, as we can see below. #### Suggested labels #### { "key": "assisted-generation", "value": "Text generation with the use of an assistant model for latency reduction" }

383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### DetailsSimilarity score: 0.87 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base) Deepseek Coder Introduction ---------------------------- Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. ### Key Features - **Massive Training Data:** Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages. - **Highly Flexible & Scalable:** Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements. - **Superior Model Performance:** State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks. - **Advanced Code Completion Capabilities:** A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks. ### Model Summary - **deepseek-coder-5.7bmqa-base:** A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens. - **Home Page:** [DeepSeek](http://deepseek.com) - **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder) - **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://github.com/deepseek-ai/deepseek-coder/discussions) ### How to Use This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks. #### Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` #### Code Insertion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """<|begin|>def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[0] left = [] right = [] <|hole|> if arr[i] < pivot: left.append(arr[i]) else: right.append(arr[i]) return quick_sort(left) + [pivot] + quick_sort(right)<|end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):]) ``` #### Repository Level Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """#utils.py import torch from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score def load_data(): iris = datasets.load_iris() X = iris.data y = iris.target # Standardize the data scaler = StandardScaler() X = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Convert numpy data to PyTorch tensors X_train = torch.tensor(X_train, dtype=torch.float32) X_test = torch.tensor(X_test, dtype=torch.float32) y_train = torch.tensor(y_train, dtype=torch.int64) y_test = torch.tensor(y_test, dtype=torch.int64) return X_train, X_test, y_train, y_test def evaluate_predictions(y_test, y_pred): return accuracy_score(y_test, y_pred) #model.py import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset class IrisClassifier(nn.Module): def __init__(self): super(IrisClassifier, self).__init__() self.fc = nn.Sequential( nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 3) ) def forward(self, x): return self.fc(x) def train_model(self, X_train, y_train, epochs, lr, batch_size): criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(self.parameters(), lr=lr) # Create DataLoader for batches dataset = TensorDataset(X_train, y_train) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) for epoch in range(epochs): for batch_X, batch_y in dataloader: optimizer.zero_grad() outputs = self(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() def predict(self, X_test): with torch.no_grad(): outputs = self(X_test) _, predicted = outputs.max(1) return predicted.numpy() #main.py from utils import load_data, evaluate_predictions from model import IrisClassifier as Classifier def main(): # Model training and evaluation """ inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=140) print(tokenizer.decode(outputs[0])) ``` License ------- This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use. See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details. Contact ------- If you have any questions, please raise an issue or contact us at [agi\_code@deepseek.com](mailto:agi_code@deepseek.com). #### Suggested labels #### { "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

625: unsloth/README.md at main · unslothai/unsloth

### DetailsSimilarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) # unsloth/README.md at main · unslothai/unsloth
unsloth logo ### Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory! ![](https://i.ibb.co/sJ7RhGG/image-41.png)
## ✨ Finetune for Free All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | **Gemma 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) | 2.4x faster | 58% less | | **Mistral 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) | 2.2x faster | 62% less | | **Llama-2 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing) | 2.2x faster | 43% less | | **TinyLlama** | [▶️ Start on Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing) | 3.9x faster | 74% less | | **CodeLlama 34b** A100 | [▶️ Start on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing) | 1.9x faster | 27% less | | **Mistral 7b** 1xT4 | [▶️ Start on Kaggle](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook) | 5x faster\* | 62% less | | **DPO - Zephyr** | [▶️ Start on Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) | 1.9x faster | 19% less | - This [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing) is useful for ShareGPT ChatML / Vicuna templates. - This [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr. - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. ## 🦥 Unsloth.ai News - 📣 [Gemma 7b](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) on 6T tokens now works. And [Gemma 2b notebook](https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing) - 📣 Added [conversational notebooks](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) and [raw text notebooks](https://colab.research.google.com/drive/1bMOKOBzxQWUIGZBs_B0zm8pimuEnZdfM?usp=sharing) - 📣 [2x faster inference](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) added for all our models - 📣 [DPO support](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) is now included. [More info](#DPO) on DPO - 📣 We did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗Hugging Face and are in their official docs! Check out the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth) - 📣 [Download models 4x faster](https://huggingface.co/collections/unsloth/) from 🤗Hugging Face. Eg: `unsloth/mistral-7b-bnb-4bit` ## 🔗 Links and Resources | Type | Links | | ------------------------------- | --------------------------------------- | | 📚 **Wiki & FAQ** | [Read Our Wiki](https://github.com/unslothai/unsloth/wiki) | | 📜 **Documentation** | [Read The Doc](https://github.com/unslothai/unsloth/tree/main#-documentation) | | 💾 **Installation** | [unsloth/README.md](https://github.com/unslothai/unsloth/tree/main#installation-instructions)| |   **Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)| | 🥇 **Benchmarking** | [Performance Tables](https://github.com/unslothai/unsloth/tree/main#-performance-benchmarking) | 🌐 **Released Models** | [Unsloth Releases](https://huggingface.co/unsloth)| | ✍️ **Blog** | [Read our Blogs](https://unsloth.ai/blog)| ## ⭐ Key Features - All kernels written in [OpenAI's Triton](https://openai.com/research/triton) language. **Manual backprop engine**. - **0% loss in accuracy** - no approximation methods - all exact. - No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow. - Works on **Linux** and **Windows** via WSL. - Supports 4bit and 16bit QLoRA / LoRA finetuning via [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). - Open source trains 5x faster - see [Unsloth Pro](https://unsloth.ai/) for **30x faster training**! - If you trained a model with 🦥Unsloth, you can use this cool sticker!   ## 🥇 Performance Benchmarking - For the full list of **reproducable** benchmarking tables, [go to our website](https://unsloth.ai/blog/mistral-benchmark#Benchmark%20tables) | 1 A100 40GB | 🤗Hugging Face | Flash Attention | 🦥Unsloth Open Source | 🦥[Unsloth Pro](https://unsloth.ai/pricing) | |--------------|--------------|-----------------|---------------------|-----------------| | Alpaca | 1x | 1.04x | 1.98x | **15.64x** | | LAION Chip2 | 1x | 0.92x | 1.61x | **20.73x** | | OASST | 1x | 1.19x | 2.17x | **14.83x** | | Slim Orca | 1x | 1.18x | 2.22x | **14.82x** | - Benchmarking table below was conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl). | Free Colab T4 | Dataset | 🤗Hugging Face | Pytorch 2.1.1 | 🦥Unsloth | 🦥 VRAM reduction | | --- | --- | --- | --- | --- | --- | | Llama-2 7b | OASST | 1x | 1.19x | 1.95x | -43.3% | | Mistral 7b | Alpaca | 1x | 1.07x | 1.56x | -13.7% | | Tiny Llama 1.1b | Alpaca | 1x | 2.06x | 3.87x | -73.8% | | DPO with Zephyr | Ultra Chat | 1x | 1.09x | 1.55x | -18.6% | ![](https://i.ibb.co/sJ7RhGG/image-41.png) [View on GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) #### Suggested labels ####

515: neulab/external-knowledge-codegen: Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"

### DetailsSimilarity score: 0.87 - [ ] [neulab/external-knowledge-codegen: Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"](https://github.com/neulab/external-knowledge-codegen) # **TITLE**: neulab/external-knowledge-codegen: Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation" ## **DESCRIPTION**: Incorporating External Knowledge through Pre-training for Natural Language to Code Generation This repository contains code and resources for the ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation". Some of the code is borrowed from the awesome TranX semantic parsing software. If you are interested in the underlying neural code generation model used in this paper, please have a look! ## TL;DR Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. If you want to try out our strong pre-trained English-to-Python generation models, check out this section. Our approach: incorporating external knowledge by data re-sampling, pre-training and fine-tuning. ## Examples from Python API documentation and pre-processed code snippets, including class constructors, methods, and top-level functions. We use red, blue, and green to denote required, optional positional, and optional keyword arguments respectively. ## Performance comparison of different strategies to incorporate external knowledge. ## Prepare Environment We recommend using conda to manage the environment: ``` conda env create -n "tranx" -f config/conda_environment.yml conda activate tranx ``` Some key dependencies and their versions are: ``` python=3.7 pytorch=1.1.0 astor=0.7.1 (This is very important) ``` ## Getting and Preprocessing External Resources One of the most important steps presented in the paper is the external knowledge/resources used for pre-training the code generation model. We will show how we obtain the StackOverflow mined data as well as the Python API documentation and the preprocessing steps. ### Mined StackOverflow Pairs Download conala-corpus-v1.1.zip and unzip the content into data/conala/. Make sure you have conala-(mined|train|test).jsonl in that directory. ### Python Standard Library API Documentation We provide our processed API documents into our data format which is the same as the aforementioned Conala dataset. You can find the preprocessed NL-code pairs at apidocs/python-docs.jsonl. However, if you prefer to process the API documents from scratch, you need to first download the official Python source code from here, in this paper, we use the documentation from Python 3.7.5. extract everything into apidocs/Python-3.7.5. Then cd into that directory, and follow the instructions to build the HTML version of the Python documentation. Basically it's make venv followed by make html. After this, please check apidocs/Python-3.7.5/Doc/build/html/library directory to see if the generated HTML library documentations are there. Yay! To actually parse all the documentation and output the same NL-code pair format as the model supports, please run apidocs/doc_parser.py, which would generate apidocs/python-docs.jsonl. ### Resampling API Knowledge As we found in the paper, external knowledge from different sources has different characteristics. NL-code pairs automatically mined from StackOverflow are good representatives of the questions that developers may ask, but are inevitably noisy. NL-code pairs from API documentation are clean, but there may be a topical distribution shift from real questions asked by developers. We show that resampling the API documentation is crucial to minimize the distribution gap and improve pretraining performance. You can find resampled API corpus as used in the experiments in the paper in apidocs/processed. direct contains corpus resampled via "direct retrieval". distsmpl contains corpus resampled via "distribution estimation". Both are compared in the experiments, and distsmpl has better performance. The filenames of the resampled corpus represent different strategies. snippet or intent means retrieved by code snippet or NL intent. tempX means the temperature parameter is X. topK means top K retrieval results are used for resampling. If you are interested in performing the resampling step on your own, you will need to load python-docs.jsonl into an ElasticSearch instance that provides retrieval functionality. Check out apidocs/index_es.py for indexing the API documents, and apidocs/retrieve.py for actual retrieval and resampling. ## Pretraining and Finetuning Underlying Code Generation Model For this part, our underlying model is TranX for code generation, and the code is modified and integrated in this repo. Our paper's training strategy is basically 3-step: pretrain on mined + API data, finetune on CoNaLa dataset, and rerank. Preprocess all the data into binarized dataset and vocab. All related operations are in datasets/conala/dataset.py. For our best performing experiment, with is mined (top 100K) + API (dist. resampled w/ code, k = 1 and t = 2), run the following to create the dataset: ``` mkdir data/conala python datasets/conala/dataset.py --pretrain path/to/conala-mined.jsonl --topk 100000 --include_api apidocs/processed/distsmpl/snippet_15k/goldmine_snippet_count100k_topk1_temp2.jsonl ``` By default things should be preprocessed and saved to data/conala. Check out those .bin files. ### Pretraining Check out the script scripts/conala/train_retrieved_distsmpl.sh for our best performing strategy. Under the directory you could find scripts for other strategies compared in the experiments as well. Basically, you have to specify number of mined pairs (50k or 100k), retrieval method (snippet_count100k_topk1_temp2, etc.): ``` scripts/conala/train_retrieved_distsmpl.sh 100000 snippet_count100k_topk1_temp2 ``` If anything goes wrong, make sure you have already preprocessed the corresponding dataset/strategy in the previous step. The best model will be saved to saved_models/conala ### Finetuning Check out the script scripts/conala/finetune_retrieved_distsmpl.sh for best performing finetuning on CoNaLa training dataset (clean). The parameters are similar as above, number of mined pairs (50k or 100k), retrieval method (snippet_count100k_topk1_temp2, etc.), and additionally, the previous pretrained model path: ``` scripts/conala/finetune_retrieved_distsmpl.sh 100000 snippet_count100k_topk1_temp2 saved_models/conala/retdistsmpl.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.goldmine_snippet_count100k_topk1_temp2.bin.pre_100000_goldmine_snippet_count100k_topk1_temp2.bin.seed0.bin ``` For other strategies, modify accordingly and refer to other finetune_xxx.sh scripts. The best model will also be saved to saved_models/conala. ### Reranking Reranking is not the core part of this paper, please refer to this branch and the paper. This is an orthogonal post-processing step. In general, you will first need to obtain the decoded hypothesis list after beam-search of the train/dev/test set in CoNaLA, and train the reranking weight on it. To obtain decodes, run scripts/conala/decode.sh . The outputs will be saved at decodes/conala Then, train the reranker by scripts/conala/rerank.sh .dev.bin.decode/.test.decode For easy use, #### Suggested labels #### null