In order to help you quickly reproduce the existing works of text style transfer, we release the outputs of all models and the corresponding references.
outputs/
directory.references/
directory. We also release the three more references we collected on the yelp test dataset, namely reference[0,1,2,3].0
(the transferred references of negative sentences) and reference[0,1,2,3].1
(the transferred references of positive sentences). The reference0.0
and reference0.1
are collected by Li et al., 2018. We strongly recommend that you use the released multi-references dataset because it has a stronger correlation with human evaluation results.Ps: We welcome other researchers pull request the outputs of your models.
data/yelp
directory, where x.0
denotes the negative x
type of data and x.1
denotes the positive x
type of data. x
is in [train, dev, test]
.data/yelp/tsf_template
directory. x.0.tsf
denotes the negative transferred file in which each line only has the sentiment transferred sentence, while x.0-1.tsf
denotes the negative transferred file in which each line has both the original sentence (input) and sentiment transferred sentence (output).Since the GYAFC dataset is only free of charge for research purposes, we only publish a subset of the test dataset in the family and relationships domain (data/GYAFC/
), the outputs (outputs/GYAFC/
) of each system (including our model and all baselines) and the corresponding human references (references/GYAFC/
). If you want to download the train and validation dataset, please follow the guidance at https://github.com/raosudha89/GYAFC-corpus. And then, name the corpora of two styles as the yelp dataset.
First of all, you should specify the dataset. For example, for yelp dataset:
export DATASET=yelp
If you want to use your own datasets, please follow the guidance of next section Extend to other tasks and datasets.
cd classifier
python textcnn.py --mode train
Note: If you get the error no module named opennmt
, please install OpenNMT-tf
: pip install OpenNMT-tf==1.15.0
.
To generate pseudo-parallel data, we follow the template-based method proposed by Li et al., 2018. And we have provided the pseudo-parallel data of the yelp dataset in the data/yelp/tsf_template
directory. However, if you want to generate the pseudo-parallel data using templates, you can follow this link or design your own templates which are suitable for your task and dataset.
The default encoder and decoder are bilstm.
cd nmt
python nmt.py --mode train --nmt_direction 0-1 --n_epoch 5 # Pre-train forward (f) model
python nmt.py --mode train --nmt_direction 1-0 --n_epoch 5 # Pre-train backward (g) model
If you want to adopt transformer as encoder and decoder, run the following code:
cd nmt
python nmt.py --mode train --nmt_direction 0-1 --n_epoch 5 --n_layer 6 --encoder_decoder_type transformer
python nmt.py --mode train --nmt_direction 1-0 --n_epoch 5 --n_layer 6 --encoder_decoder_type transformer
python dual_training.py --n_epoch 10
The final transffered results are in the ../tmp/output/${DATASET}_final/
dir.
If you don't have parallel or paired data, here are the processes you might go through:
data/yelp/
and references/yelp/
.If you have parallel or paired data, here are the processes you might go through:
data/yelp/tsf_template
data/yelp/
and references/yelp/
.You can run the following code to see which parameters need to be set
python [dual_training.py | nmt.py | textcnn.py] --help
For some tasks, Li's method can't be used to generate pseudo-parallel data. Here are some related frequently asked questions:
You can refer to this issue to generate pseudo-parallel data via simply add some noise to the original sentence.
Of course, you can! Actually, we have tried to use CrossAlignment(Shen et al.,) to generated pseudo-parallel data. However, the experimental results are worse than using template-based methods.
We have tried to merge pseudo-parallel data generated by CrossAlignment (Shen et al.,) and Template-based(Li et al.,) to pre-train our model. There is a slight improvement in the experimental results.
This is an interesting question. I will try to remove the pre-training step. I think a feasible solution is to just initialize the word-embeddings of seq2seq (nmt) model, inspired by the three principles of unsupervised machine translation.
Note: No matter what method you use to construct pseudo-parallel data, the style transferred sentence or generated sentence y'
(lower quality) should be the input, not the output (ground truth). This is validated to be important by our experiments. And what you need to actually do is to put y'\tx\n
into files of tsf-template
dir.
python==2.7
numpy==1.14.2
tensorflow==1.13.1
OpenNMT-tf==1.15.0
If you use this code, please cite the following paper:
@inproceedings{Luo19DualRL,
author = {Fuli Luo and
Peng Li and
Jie Zhou and
Pengcheng Yang and
Baobao Chang and
Zhifang Sui and
Xu Sun},
title = {A Dual Reinforcement Learning Framework for Unsupervised Text Style Transfer},
booktitle = {Proceedings of the 28th International Joint Conference on Artificial Intelligence, {IJCAI} 2019},
year = {2019},
}