wasiahmad / AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
https://arxiv.org/abs/2108.11590
Creative Commons Attribution Share Alike 4.0 International
53 stars 10 forks source link
programming-language programming-language-translator representation-learning translation-model

AVATAR

Official code release of our work, [AVATAR: A Parallel Corpus for Java-Python Program Translation](https://arxiv.org/abs/2108.11590).

SetupDatasetModelsTraining & EvaluationBenchmarksLicenseCitation

:mega: Notice related to a dataset bug (:bug:) fix :point_left:

There was a major bug in the AVATAR dataset as raised in this issue. We observed that while crawling data from different sources, in many examples, new lines were missing. In Python data, we also observed missing indentation. As a result, programs were not parse-able. We re-crawled data and ensured every program we store is parse-able. The :bug: has been fixed, so you can continue using the dataset seamlessly.

What is AVATAR?

Setup

conda create --name avatar_env python==3.8
conda activate avatar_env
pip install -r requirements.txt

mkdir -p third_party
cd third_party
git clone https://github.com/tree-sitter/tree-sitter-java.git
git clone https://github.com/tree-sitter/tree-sitter-python.git

# optional (for fp16 training)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
cd ..

# building tree-sitter library
python build.py

Dataset

The dataset details is provided here. You can download the data by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

If you want to perform the preparation of your own, run:

cd data
bash prepare.sh

Models

We studied 11 models for program translation.

[Models trained from scratch]

[Pre-trained models]

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn, Transformer
cd seq2seq
bash rnn.sh GPU_ID SOURCE_LANG TARGET_LANG
bash transformer.sh GPU_ID SOURCE_LANG TARGET_LANG

# CodeBERT, GraphCoderBERT, CodeT5, PLBART
cd [codebert|graphcodebert|codet5|plbart]
bash run.sh GPU_ID SOURCE_LANG TARGET_LANG

# CodeGPT, CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID SOURCE_LANG TARGET_LANG [CodeGPT|adaptedCodeGPT]

# Transcoder, Transcoder-DOBF, Transcoder-ST 
cd transcoder
bash zero_shot.sh GPU_ID SOURCE_LANG TARGET_LANG [transcoder|transcoder-dobf|transcoder-st]

Benchmarks

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}