twairball / t2t_wmt_zhen

NMT for chinese-english using tensor2tensor
MIT License
47 stars 12 forks source link
nmt tensor2tensor tensorflow

WMT17 English-Chinese

This repo is a collection of experiments on WMT17 English-Chinese translation task.

Setup

pip install -r requirements.txt

Run

# orig wmt zh-en translation task in t2t repo
./main/gen_orig.sh
./main/train_orig.sh

# baseline wmt
./main/gen_base.sh
./main/train_base.sh

# wmt with preprocessing
./main/gen_wmt.sh
./main/train_wmt.sh

Experiments

  1. Original

This is original WMT17 Zh-En translation problem from tensor2tensor repo. It trains only on News Commentary (227k lines) and builds vocab size 8k.

  1. Base

Trains using full training dataset (~24MM lines) and uses jieba segmenter for Chinese corpus tokenization. Builds vocab size 32k.

  1. Preprocessed

Trains using cleaned dataset (~18MM lines) after preprocessing:

Uses jieba segmenter for Chinese corpus. Builds vocab size 32k.

Results

Experiment Steps BLEU4
wmt-pp 240k 21.72

loss

loss

perplexity

loss

accuracy

loss

approx bleu (not accurate)

loss