Setup • Dataset • Models • Training & Evaluation • Benchmarks • License • Citation
There was a major bug in the AVATAR dataset as raised in this issue. We observed that while crawling data from different sources, in many examples, new lines were missing. In Python data, we also observed missing indentation. As a result, programs were not parse-able. We re-crawled data and ensured every program we store is parse-able. The :bug: has been fixed, so you can continue using the dataset seamlessly.
conda create --name avatar_env python==3.8
conda activate avatar_env
pip install -r requirements.txt
mkdir -p third_party
cd third_party
git clone https://github.com/tree-sitter/tree-sitter-java.git
git clone https://github.com/tree-sitter/tree-sitter-python.git
# optional (for fp16 training)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
cd ..
# building tree-sitter library
python build.py
The dataset details is provided here. You can download the data by following:
cd data
bash download.sh
To prepare the data, we perform the following steps.
If you want to perform the preparation of your own, run:
cd data
bash prepare.sh
We studied 11 models for program translation.
[Models trained from scratch]
[Pre-trained models]
To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.
# Seq2Seq+Attn, Transformer
cd seq2seq
bash rnn.sh GPU_ID SOURCE_LANG TARGET_LANG
bash transformer.sh GPU_ID SOURCE_LANG TARGET_LANG
# CodeBERT, GraphCoderBERT, CodeT5, PLBART
cd [codebert|graphcodebert|codet5|plbart]
bash run.sh GPU_ID SOURCE_LANG TARGET_LANG
# CodeGPT, CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID SOURCE_LANG TARGET_LANG [CodeGPT|adaptedCodeGPT]
# Transcoder, Transcoder-DOBF, Transcoder-ST
cd transcoder
bash zero_shot.sh GPU_ID SOURCE_LANG TARGET_LANG [transcoder|transcoder-dobf|transcoder-st]
SOURCE_LANG=[java|python]
or TARGET_LANG=[java|python]
.This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.
@article{ahmad-etal-2021-avatar,
title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
journal={arXiv preprint arXiv:2108.11590},
year={2021}
}