An implementation of "Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation".
You can also find Indonesian informal-formal parallel corpus in this repository.
We were researching transforming a sentence from informal to its formal form. Our work addresses a style-transfer from informal to formal Indonesian as a low-resource machine translation problem. We benchmark several strategies to perform the style transfer.
In this repository, we provide the Phrase-Based Statistical Machine Translation, which has the highest result in our experiment. Note that, our data is extremely low-resource and domain-specific (Customer Service domain). Therefore, the system might not be robust towards out-of-domain input. Our future work includes exploring more robust style transfer. Stay tuned!
You can access our paper below:
Medium Article: Mengubah Bahasa Indonesia Informal Menjadi Baku Menggunakan Kecerdasan Buatan (In Indonesian)
We use the RELEASE 4.0 Ubuntu 17.04+ version which only works on the specified OS.
We haven't tested it on other OS (e.g.: OS X and Windows). If you want to run the source code, use Ubuntu 17.04+. If you use windows, we advise you to use the WSL-2 to run the code.
In this experiment, we wrap the MOSES code by using Python's subprocess
. So Python installation is necessary. The system is tested on Python 3.9. We recommend it to install with miniconda
. You can install it by following this link: https://docs.conda.io/en/latest/miniconda.html
First, clone the repository
git clone https://github.com/haryoa/stif-indonesia.git
Then run the MOSES downloader. We use .sh, so use a CLI application that can execute it. On the root project folder directory, do:
bash scripts/download_moses.sh
The script will download the Moses toolkit and extract it by itself.
Before running the program you have to install some prerequisites packages:
pip install -r requirements.txt
Alternatively, if you prefer to use pipenv
instead you can run:
pipenv install
NOTE: If you prefer to use pipenv
you should preceed the command with pipenv run
. E.g: pipenv run python -m stif_indonesia --exp-scenario supervised
To run the supervised one, do:
python -m stif_indonesia --exp-scenario supervised
It will read the experiment config in experiment-config/00001_default_supervised_config.json
To run the semi-supervised one, do:
python -m stif_indonesia --exp-scenario semi-supervised
It will read the experiment config in experiment-config/00002_default_semi_supervised_config.json
log.log
output
folderIt will output evaluation
, lm
, and train
. evaluation
is the result of prediction on the test set, lm
is the output of the trained LM, and train
is the produced model by the Moses toolkit
It will output agg_data
, best_model_dir
, and produced_tgt_data
. agg_data
is the result of the forward-iteration data synthesis. best_model_dir
is the best model produced by the training process, and produced_tgt_data
is the prediction output of the test set.
Please check the log.log
file which is the output of the process.
If you want to replicate the dictionary-based method, you can use any informal - formal or slang dictionary on the internet.
For example, you can use this dictionary.
If you want to replicate our GPT-2 experiment, you can use a pre-trained Indonesian GPT-2 such as this one, or train it by yourself by using Oscar Corpus. After that, you can finetune it with the dataset that we have provided here. You should follow the paper on how to transform the data when you do the finetuning.
Note: You can also download oscar-corpus from Huggingface's datasets.
We use Huggingface's off-the-shelf implementation to train the model.