Unsupervised Data Augmentation or UDA is a semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks.
With only 20 labeled examples, UDA outperforms the previous state-of-the-art on IMDb trained on 25,000 labeled examples.
Model | Number of labeled examples | Error rate |
---|---|---|
Mixed VAT (Prev. SOTA) | 25,000 | 4.32 |
BERT | 25,000 | 4.51 |
UDA | 20 | 4.20 |
It reduces more than 30% of the error rate of state-of-the-art methods on CIFAR-10 with 4,000 labeled examples and SVHN with 1,000 labeled examples:
Model | CIFAR-10 | SVHN |
---|---|---|
ICT (Prev. SOTA) | 7.66±.17 | 3.53±.07 |
UDA | 4.31±.08 | 2.28±.10 |
It leads to significant improvements on ImageNet with 10% labeled data.
Model | top-1 accuracy | top-5 accuracy |
---|---|---|
ResNet-50 | 55.09 | 77.26 |
UDA | 68.78 | 88.80 |
UDA is a method of semi-supervised learning, that reduces the need for labeled examples and better utilizes unlabeled ones.
We are releasing the following:
All of the code in this repository works out-of-the-box with GPU and Google Cloud TPU.
The code is tested on Python 2.7 and Tensorflow 1.13. After installing Tensorflow, run the following command to install dependencies:
pip install --user absl-py
We generate 100 augmented examples for every original example. To download all the augmented data, go to the image directory and run
AUG_COPY=100
bash scripts/download_cifar10.sh ${AUG_COPY}
Note that you need 120G disk space for all the augmented data. To save space, you can set AUG_COPY to a smaller number such as 30.
Alternatively, you can generate the augmented examples yourself by running
AUG_COPY=100
bash scripts/preprocess.sh --aug_copy=${AUG_COPY}
GPU command:
# UDA accuracy:
# 4000: 95.68 +- 0.08
# 2000: 95.27 +- 0.14
# 1000: 95.25 +- 0.10
# 500: 95.20 +- 0.09
# 250: 94.57 +- 0.96
bash scripts/run_cifar10_gpu.sh --aug_copy=${AUG_COPY}
# UDA accuracy:
# 4000: 97.72 +- 0.10
# 2000: 97.80 +- 0.06
# 1000: 97.77 +- 0.07
# 500: 97.73 +- 0.09
# 250: 97.28 +- 0.40
bash scripts/run_svhn_gpu.sh --aug_copy=${AUG_COPY}
The movie review texts in IMDb are longer than many classification tasks so using a longer sequence length leads to better performances. The sequence lengths are limited by the TPU/GPU memory when using BERT (See the Out-of-memory issues of BERT). As such, we provide scripts to run with shorter sequence lengths and smaller batch sizes.
If you want to run UDA with BERT base on a GPU with 11 GB memory, go to the text directory and run the following commands:
# Set a larger max_seq_length if your GPU has a memory larger than 11GB
MAX_SEQ_LENGTH=128
# Download data and pretrained BERT checkpoints
bash scripts/download.sh
# Preprocessing
bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}
# Baseline accuracy: around 68%
bash scripts/run_base.sh --max_seq_length=${MAX_SEQ_LENGTH}
# UDA accuracy: around 90%
# Set a larger train_batch_size to achieve better performance if your GPU has a larger memory.
bash scripts/run_base_uda.sh --train_batch_size=8 --max_seq_length=${MAX_SEQ_LENGTH}
The best performance in the paper is achieved by using a max_seq_length of 512 and initializing with BERT large finetuned on in-domain unsupervised data. If you have access to Google Cloud TPU v3-32 Pod, try:
MAX_SEQ_LENGTH=512
# Download data and pretrained BERT checkpoints
bash scripts/download.sh
# Preprocessing
bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}
# UDA accuracy: 95.3% - 95.9%
bash train_large_ft_uda_tpu.sh
First of all, install the following dependencies:
pip install --user nltk
python -c "import nltk; nltk.download('punkt')"
pip install --user tensor2tensor==1.13.4
The following command translates the provided example file. It automatically splits paragraphs into sentences, translates English sentences to French and then translates them back into English. Finally, it composes the paraphrased sentences into paragraphs. Go to the back_translate directory and run:
bash download.sh
bash run.sh
There is a variable sampling_temp in the bash file. It is used to control the diversity and quality of the paraphrases. Increasing sampling_temp will lead to increased diversity but worse quality. Surprisingly, diversity is more important than quality for many tasks we tried.
We suggest trying to set sampling_temp to 0.7, 0.8 and 0.9. If your task is very robust to noise, sampling_temp=0.9 or 0.8 should lead to improved performance. If your task is not robust to noise, setting sampling temp to 0.7 or 0.6 should be better.
If you want to do back translation to a large file, you can change the replicas and worker_id arguments in run.sh. For example, when replicas=3, we divide the data into three parts, and each run.sh will only process one part according to the worker_id.
UDA works out-of-box and does not require extensive hyperparameter tuning, but to really push the performance, here are suggestions about hyperparamters:
A large portion of the code is taken from BERT and RandAugment. Thanks!
Please cite this paper if you use UDA.
@article{xie2019unsupervised,
title={Unsupervised Data Augmentation for Consistency Training},
author={Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Minh-Thang and Le, Quoc V},
journal={arXiv preprint arXiv:1904.12848},
year={2019}
}
Please also cite this paper if you use UDA for images.
@article{cubuk2019randaugment,
title={RandAugment: Practical data augmentation with no separate search},
author={Cubuk, Ekin D and Zoph, Barret and Shlens, Jonathon and Le, Quoc V},
journal={arXiv preprint arXiv:1909.13719},
year={2019}
}
This is not an officially supported Google product.