irthomasthomas / undecidability

12 stars 2 forks source link

OpenCodeInterpreter/README.md at main · OpenCodeInterpreter/OpenCodeInterpreter #658

Open irthomasthomas opened 8 months ago

irthomasthomas commented 8 months ago

OpenCodeInterpreter/README.md at main · OpenCodeInterpreter/OpenCodeInterpreter

Description

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

OpenCodeInterpreter

[🏠Homepage] | [🛠️Code]


🌟 Upcoming Features

🔔News

🛠️[2024-02-28]: We have open-sourced the Demo Local Deployment Code with a Setup Guide.

✨[2024-02-26]: We have open-sourced the OpenCodeInterpreter-DS-1.3b Model.

📘[2024-02-26]: We have open-sourced the CodeFeedback-Filtered-Instruction Dataset.

🚀[2024-02-23]: We have open-sourced the datasets used in our project named Code-Feedback.

🔥[2024-02-19]: We have open-sourced all models in the OpenCodeInterpreter series! We welcome everyone to try out our models and look forward to your participation! 😆

Introduction

OpenCodeInterpreter is a suite of open-source code generation systems aimed at bridging the gap between large language models and sophisticated proprietary systems like the GPT-4 Code Interpreter. It significantly enhances code generation capabilities by integrating execution and iterative refinement functionalities.

Models

All models within the OpenCodeInterpreter series have been open-sourced on Hugging Face. You can access our models via the following link: OpenCodeInterpreter Models.

Data Collection

Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter incorporates execution and human feedback for dynamic code refinement. For additional insights into data collection procedures, please consult the readme provided under Data Collection.

Evaluation

Our evaluation framework primarily utilizes HumanEval and MBP, alongside their extended versions, HumanEval+ and MBPP+, leveraging the EvalPlus framework for a more comprehensive assessment. For specific evaluation methodologies, please refer to the Evaluation README for more details.

Contact

If you have any inquiries, please feel free to raise an issue or reach out to us via email at: xiangyue.work@gmail.com, zhengtianyu0428@gmail.com. We're here to assist you!

URL

Suggested labels

{'label-name': 'frameworks', 'label-description': 'Frameworks and tools used for evaluation and assessment.', 'gh-repo': 'OpenCodeInterpreter/OpenCodeInterpreter', 'confidence': 58.17}

irthomasthomas commented 8 months ago

Related issues

498: CodeGPTPlus/deepseek-coder-1.3b-typescript · Hugging Face

### DetailsSimilarity score: 0.88 - [ ] [CodeGPTPlus/deepseek-coder-1.3b-typescript · Hugging Face](https://huggingface.co/CodeGPTPlus/deepseek-coder-1.3b-typescript) # CodeGPTPlus/deepseek-coder-1.3b-typescript This is a fine-tuned model by the CodeGPT team, specifically crafted for generating expert code in TypeScript. It is fine-tuned from `deepseek-ai/deepseek-coder-1.3b-base` with a dataset of 0.5B tokens, making it an excellent choice for precise and efficient TypeScript code generation. The model uses a 16K window size and an additional fill-in-the-middle task for project-level code completion. ## How to Use This model is for completion purposes only. Here are some examples of how to use the model: ### Running the model on a GPU ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("CodeGPTPlus/deepseek-coder-1.3b-typescript", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("CodeGPTPlus/deepseek-coder-1.3b-typescript", trust_remote_code=True).cuda() input_text = """<|fim begin|>function quickSort(arr: number[]): number[] { if (arr.length <= 1) { return arr; } const pivot = arr[0]; const left = []; const right = []; <|fim hole|> return [...quickSort(left), pivot, ...quickSort(right)]; }<|fim end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Running with Ollama - Model: [https://ollama.ai/codegpt/deepseek-coder-1.3b-typescript](https://ollama.ai/codegpt/deepseek-coder-1.3b-typescript) - Command: `ollama run codegpt/deepseek-coder-1.3b-typescript` ### Running with Ollama and CodeGPT Autocomplete in VSCode - Documentation: [https://docs.codegpt.co/docs/tutorial-features/code\_autocompletion](https://docs.codegpt.co/docs/tutorial-features/code_autocompletion) - Select "Ollama - codegpt/deepseek-coder-1.3b-typescript" in the autocomplete model selector. ### Fill In the Middle (FIM) ```python <|fim begin|>function quickSort(arr: number[]): number[] { if (arr.length <= 1) { return arr; } const pivot = arr[0]; const left = []; const right = []; <|fim hole|> return [...quickSort(left), pivot, ...quickSort(right)]; }<|fim end|> ``` ### Training Procedure The model was trained using the following hyperparameters: - learning\_rate: 2e-05 - train\_batch\_size: 20 - eval\_batch\_size: 20 - seed: 42 - gradient\_accumulation\_steps: 2 - total\_train\_batch\_size: 40 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06 - lr\_scheduler\_type: cosine - lr\_scheduler\_warmup\_steps: 261 - num\_epochs: 1 For more information, visit the [model page](https://huggingface.co/CodeGPTPlus/deepseek-coder-1.3b-typescript). #### Suggested labels #### { "label-name": "TypeScript-Code-Generation", "description": "Model for generating TypeScript code", "repo": "CodeGPTPlus/deepseek-coder-1.3b-typescript", "confidence": 70.59 }

392: llm-vscode - Visual Studio Marketplace

### DetailsSimilarity score: 0.88 - [ ] [llm-vscode - Visual Studio Marketplace](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode) LLM Powered Development for VSCode ================================= We are excited to announce the release of `llm-vscode`, a VSCode extension for all things LLM! This extension uses [`llm-ls`](https://github.com/huggingface/llm-ls) as its backend and includes extensions for `neovim`, `jupyter`, `intellij`, and the previously known `huggingface-vscode`. **Note**: When using the Inference API, you may encounter some limitations. To avoid getting rate limited in the free tier, consider subscribing to the [PRO plan](https://huggingface.co/pricing#pro). Features -------- - **Code Completion**: This plugin supports "ghost-text" code completion, à la Copilot. - **Model Selection**: Requests for code generation are made via an HTTP request. You can use the Hugging Face Inference API or your own HTTP endpoint, provided it adheres to the API specified [here](https://github.com/huggingface/llm#api-reference) or [here](https://github.com/huggingface/llm/blob/main/docs/llm-ls.md). The list of officially supported models is located in the [config template section](https://github.com/huggingface/llm-vscode#configuration). - **Context Window**: The prompt sent to the model will always be sized to fit within the context window, with the number of tokens determined using tokenizers. - **Code Attribution**: Hit `Cmd+shift+a` to check if the generated code is in The Stack. This is a rapid first-pass attribution check using [stack.dataportraits.org](https://stack.dataportraits.org/). We check for sequences of at least 50 characters that match a Bloom filter. This means false positives are possible, and long enough surrounding context is necessary (see the [paper](https://arxiv.org/abs/2107.03374) for details on n-gram striding and sequence length). Installation ------------ Install like any other vscode extension. By default, this extension uses `bigcode/starcoder` & Hugging Face Inference API for the inference. HF API Token ------------ You can supply your HF API token (`hf.co/settings/token`) with the following steps: 1. `Cmd/Ctrl+Shift+P` to open VSCode command palette 2. Type: `Llm: Login` If you previously logged in with `huggingface-cli login` on your system, the extension will read the token from disk. Configuration ------------- You can check the full list of configuration settings by opening your settings page (`cmd+,`) and typing `Llm`. Endpoint -------- You can configure the endpoint to which requests will be sent. The request body will look like: ```json { "inputs": "{start token}import numpy as np\nimport scipy as sp\n{end token}def hello_world():\n print("Hello world"){middle token}", "parameters": { "max_new_tokens": 256 } } ``` Suggestion Behavior ------------------- You can tune the way the suggestions behave: - `llm.enableAutoSuggest` lets you choose to enable or disable "suggest-as-you-type" suggestions. - `llm.documentFilter` lets you enable suggestions only on specific files that match the pattern matching syntax you will provide. The object must be of type `DocumentFilter | DocumentFilter[]`. - `llm-vscode` sets two keybindings: - You can trigger suggestions with `Cmd+shift+l` by default, which corresponds to the `editor.action.inlineSuggest.trigger` command. - Code attribution is set to `Cmd+shift+a` by default, which corresponds to the `llm.attribution` command. For more information, see the [documentation](https://github.com/huggingface/llm-vscode#keybindings). LLM-LS ------ By default, `llm-ls` is bundled with the extension. When developing locally or if you built your own binary because your platform is not supported, you can set the `llm.lsp.binaryPath` setting to the path of the binary. Tokenizer --------- `llm-ls` uses tokenizers to make sure the prompt fits the context window. To configure it, you have a few options: - No tokenization, `llm-ls` will count the number of characters instead. - From a local file on your disk. - From a Hugging Face repository, `llm-ls` will attempt to download `tokenizer.json` at the root of the repository. - From an HTTP endpoint, `llm-ls` will attempt to download a file via an HTTP GET request. For more information, see the [documentation](https://github.com/huggingface/llm-vscode#tokenizer). Code Llama ---------- To test the `Code Llama 13B` model: 1. Make sure you have the latest version of this extension. 2. Make sure you have supplied an HF API token. 3. Open VSCode Settings (`cmd+,`) & type: `Llm: Config Template`. 4. From the dropdown menu, choose `codellama/CodeLlama-13b-hf`. Phind and WizardCoder ------------------- To test `Phind/Phind-CodeLlama-34B-v2` and/or `WizardLM/WizardCoder-Python-34B-V1.0`: 1. Make sure you have the latest version of this extension. 2. Make sure you have supplied an HF API token. 3. Open VSCode Settings (`cmd+,`) & type: `Llm: Config Template`. 4. From the dropdown menu, choose `Phind/Phind-CodeLlama-34B-v2` or `WizardLM/WizardCoder-Python-34B-V1.0`. For more information, see the [documentation](https://github.com/huggingface/llm-vscode#phind-and-wizardcoder). Developing ---------- To contribute to the development of this extension, follow these steps: 1. Clone this repo: `git clone https://github.com/huggingface/llm-vscode` 2. Install dependencies: `cd llm-vscode && npm i` 3. Open VSCode and run the extension with the `Launch Extension` command. Community --------- Join our community to contribute to other related projects: - [huggingface-vscode-endpoint-server](https://github.com/huggingface/huggingface-vscode-endpoint-server): Custom code generation endpoint for this repository. - [llm-vscode-inference-server](https://github.com/huggingface/llm-vscode-inference-server): An endpoint server for efficiently serving quantized open-source LLMs for code. For more information, see the [documentation](https://github.com/huggingface/llm-vscode#community). #### Suggested labels #### { "key": "llm-inference-engines", "value": "Software and tools for running inference on Large Language Models" } { "key": "llama", "value": "Models and tools related to Large Language Models" }

393: llm-vscode - Visual Studio Marketplace

### DetailsSimilarity score: 0.88 - [ ] [llm-vscode - Visual Studio Marketplace](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode) #### LLM-powered Development for VSCode `llm-vscode` is a VSCode extension for all things LLM, built on top of the `llm-ls` backend. We also have extensions for `neovim`, `jupyter`, `intellij`, and previously `huggingface-vscode`. **Note:** When using the Inference API, you may encounter limitations. Consider subscribing to the PRO plan to avoid rate limiting on the free tier. [Hugging Face Pricing](https://huggingface.co/pricing#pro) #### 💻 Features - **Code Completion:** Supports "ghost-text" code completion, à la Copilot. - **Model Selection:** Requests for code generation are made via an HTTP request. You can use the Hugging Face Inference API or your own HTTP endpoint, as long as it adheres to the API specified [here]() or [here](). The list of officially supported models can be found in the [config template section](). - **Context Window:** The prompt sent to the model will always fit within the context window, using tokenizers to determine the number of tokens. - **Code Attribution:** Hit `Cmd+shift+a` to check if the generated code is in The Stack. This is a rapid first-pass attribution check using [stack.dataportraits.org](http://stack.dataportraits.org). We check for sequences of at least 50 characters that match a Bloom filter, which means false positives are possible. A complete second pass can be done using the dedicated Stack search tool, which is a full dataset index. #### 🚀 Installation Install `llm-vscode` like any other VSCode extension. By default, this extension uses `bigcode/starcoder` & Hugging Face Inference API for inference. #### 🔑 HF API Token Supply your HF API token (`hf.co/settings/token`) with this command: - Open VSCode command palette `Cmd/Ctrl+Shift+P` - Type: `Llm: Login` If you previously logged in with `huggingface-cli login` on your system, the extension will read the token from disk. #### ⚙ Configuration Check the full list of configuration settings by opening your settings page `(cmd+,)` and typing `Llm`. #### Suggested labels #### { "key": "llm-vscode", "value": "VSCode extension for LLM powered development with Hugging Face Inference API" }

383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### DetailsSimilarity score: 0.87 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base) Deepseek Coder Introduction ---------------------------- Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. ### Key Features - **Massive Training Data:** Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages. - **Highly Flexible & Scalable:** Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements. - **Superior Model Performance:** State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks. - **Advanced Code Completion Capabilities:** A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks. ### Model Summary - **deepseek-coder-5.7bmqa-base:** A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens. - **Home Page:** [DeepSeek](http://deepseek.com) - **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder) - **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://github.com/deepseek-ai/deepseek-coder/discussions) ### How to Use This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks. #### Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` #### Code Insertion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """<|begin|>def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[0] left = [] right = [] <|hole|> if arr[i] < pivot: left.append(arr[i]) else: right.append(arr[i]) return quick_sort(left) + [pivot] + quick_sort(right)<|end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):]) ``` #### Repository Level Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """#utils.py import torch from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score def load_data(): iris = datasets.load_iris() X = iris.data y = iris.target # Standardize the data scaler = StandardScaler() X = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Convert numpy data to PyTorch tensors X_train = torch.tensor(X_train, dtype=torch.float32) X_test = torch.tensor(X_test, dtype=torch.float32) y_train = torch.tensor(y_train, dtype=torch.int64) y_test = torch.tensor(y_test, dtype=torch.int64) return X_train, X_test, y_train, y_test def evaluate_predictions(y_test, y_pred): return accuracy_score(y_test, y_pred) #model.py import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset class IrisClassifier(nn.Module): def __init__(self): super(IrisClassifier, self).__init__() self.fc = nn.Sequential( nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 3) ) def forward(self, x): return self.fc(x) def train_model(self, X_train, y_train, epochs, lr, batch_size): criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(self.parameters(), lr=lr) # Create DataLoader for batches dataset = TensorDataset(X_train, y_train) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) for epoch in range(epochs): for batch_X, batch_y in dataloader: optimizer.zero_grad() outputs = self(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() def predict(self, X_test): with torch.no_grad(): outputs = self(X_test) _, predicted = outputs.max(1) return predicted.numpy() #main.py from utils import load_data, evaluate_predictions from model import IrisClassifier as Classifier def main(): # Model training and evaluation """ inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=140) print(tokenizer.decode(outputs[0])) ``` License ------- This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use. See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details. Contact ------- If you have any questions, please raise an issue or contact us at [agi\_code@deepseek.com](mailto:agi_code@deepseek.com). #### Suggested labels #### { "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

189: deepseek-coder-6.7b-instruct-8.0bpw-h8-exl2-2 · Hugging Face

### DetailsSimilarity score: 0.86 - [ ] I cannot get this to output anything but gibberish. - [x] [LoneStriker/deepseek-coder-6.7b-instruct-8.0bpw-h8-exl2-2 · Hugging Face](https://huggingface.co/LoneStriker/deepseek-coder-6.7b-instruct-8.0bpw-h8-exl2-2) 1. Introduction of Deepseek Coder Deepseek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

515: neulab/external-knowledge-codegen: Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"

### DetailsSimilarity score: 0.86 - [ ] [neulab/external-knowledge-codegen: Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"](https://github.com/neulab/external-knowledge-codegen) # **TITLE**: neulab/external-knowledge-codegen: Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation" ## **DESCRIPTION**: Incorporating External Knowledge through Pre-training for Natural Language to Code Generation This repository contains code and resources for the ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation". Some of the code is borrowed from the awesome TranX semantic parsing software. If you are interested in the underlying neural code generation model used in this paper, please have a look! ## TL;DR Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. If you want to try out our strong pre-trained English-to-Python generation models, check out this section. Our approach: incorporating external knowledge by data re-sampling, pre-training and fine-tuning. ## Examples from Python API documentation and pre-processed code snippets, including class constructors, methods, and top-level functions. We use red, blue, and green to denote required, optional positional, and optional keyword arguments respectively. ## Performance comparison of different strategies to incorporate external knowledge. ## Prepare Environment We recommend using conda to manage the environment: ``` conda env create -n "tranx" -f config/conda_environment.yml conda activate tranx ``` Some key dependencies and their versions are: ``` python=3.7 pytorch=1.1.0 astor=0.7.1 (This is very important) ``` ## Getting and Preprocessing External Resources One of the most important steps presented in the paper is the external knowledge/resources used for pre-training the code generation model. We will show how we obtain the StackOverflow mined data as well as the Python API documentation and the preprocessing steps. ### Mined StackOverflow Pairs Download conala-corpus-v1.1.zip and unzip the content into data/conala/. Make sure you have conala-(mined|train|test).jsonl in that directory. ### Python Standard Library API Documentation We provide our processed API documents into our data format which is the same as the aforementioned Conala dataset. You can find the preprocessed NL-code pairs at apidocs/python-docs.jsonl. However, if you prefer to process the API documents from scratch, you need to first download the official Python source code from here, in this paper, we use the documentation from Python 3.7.5. extract everything into apidocs/Python-3.7.5. Then cd into that directory, and follow the instructions to build the HTML version of the Python documentation. Basically it's make venv followed by make html. After this, please check apidocs/Python-3.7.5/Doc/build/html/library directory to see if the generated HTML library documentations are there. Yay! To actually parse all the documentation and output the same NL-code pair format as the model supports, please run apidocs/doc_parser.py, which would generate apidocs/python-docs.jsonl. ### Resampling API Knowledge As we found in the paper, external knowledge from different sources has different characteristics. NL-code pairs automatically mined from StackOverflow are good representatives of the questions that developers may ask, but are inevitably noisy. NL-code pairs from API documentation are clean, but there may be a topical distribution shift from real questions asked by developers. We show that resampling the API documentation is crucial to minimize the distribution gap and improve pretraining performance. You can find resampled API corpus as used in the experiments in the paper in apidocs/processed. direct contains corpus resampled via "direct retrieval". distsmpl contains corpus resampled via "distribution estimation". Both are compared in the experiments, and distsmpl has better performance. The filenames of the resampled corpus represent different strategies. snippet or intent means retrieved by code snippet or NL intent. tempX means the temperature parameter is X. topK means top K retrieval results are used for resampling. If you are interested in performing the resampling step on your own, you will need to load python-docs.jsonl into an ElasticSearch instance that provides retrieval functionality. Check out apidocs/index_es.py for indexing the API documents, and apidocs/retrieve.py for actual retrieval and resampling. ## Pretraining and Finetuning Underlying Code Generation Model For this part, our underlying model is TranX for code generation, and the code is modified and integrated in this repo. Our paper's training strategy is basically 3-step: pretrain on mined + API data, finetune on CoNaLa dataset, and rerank. Preprocess all the data into binarized dataset and vocab. All related operations are in datasets/conala/dataset.py. For our best performing experiment, with is mined (top 100K) + API (dist. resampled w/ code, k = 1 and t = 2), run the following to create the dataset: ``` mkdir data/conala python datasets/conala/dataset.py --pretrain path/to/conala-mined.jsonl --topk 100000 --include_api apidocs/processed/distsmpl/snippet_15k/goldmine_snippet_count100k_topk1_temp2.jsonl ``` By default things should be preprocessed and saved to data/conala. Check out those .bin files. ### Pretraining Check out the script scripts/conala/train_retrieved_distsmpl.sh for our best performing strategy. Under the directory you could find scripts for other strategies compared in the experiments as well. Basically, you have to specify number of mined pairs (50k or 100k), retrieval method (snippet_count100k_topk1_temp2, etc.): ``` scripts/conala/train_retrieved_distsmpl.sh 100000 snippet_count100k_topk1_temp2 ``` If anything goes wrong, make sure you have already preprocessed the corresponding dataset/strategy in the previous step. The best model will be saved to saved_models/conala ### Finetuning Check out the script scripts/conala/finetune_retrieved_distsmpl.sh for best performing finetuning on CoNaLa training dataset (clean). The parameters are similar as above, number of mined pairs (50k or 100k), retrieval method (snippet_count100k_topk1_temp2, etc.), and additionally, the previous pretrained model path: ``` scripts/conala/finetune_retrieved_distsmpl.sh 100000 snippet_count100k_topk1_temp2 saved_models/conala/retdistsmpl.dr0.3.lr0.001.lr_de0.5.lr_da15.beam15.vocab.src_freq3.code_freq3.mined_100000.goldmine_snippet_count100k_topk1_temp2.bin.pre_100000_goldmine_snippet_count100k_topk1_temp2.bin.seed0.bin ``` For other strategies, modify accordingly and refer to other finetune_xxx.sh scripts. The best model will also be saved to saved_models/conala. ### Reranking Reranking is not the core part of this paper, please refer to this branch and the paper. This is an orthogonal post-processing step. In general, you will first need to obtain the decoded hypothesis list after beam-search of the train/dev/test set in CoNaLA, and train the reranking weight on it. To obtain decodes, run scripts/conala/decode.sh . The outputs will be saved at decodes/conala Then, train the reranker by scripts/conala/rerank.sh .dev.bin.decode/.test.decode For easy use, #### Suggested labels #### null