google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.26k stars 1.18k forks source link

BLEU score reduced with subword regularization for low-resource translation #586

Closed Rashmini closed 3 years ago

Rashmini commented 3 years ago

I'm developing a transformer based NMT system for low-resource English-Sinhala translation using a parallel corpus of 54k sentences (vocab size=5k). I experimented with BPE and unigram as subword segmentation techniques and further experimented on subword regularization and BPE-dropout.

For BPE-dropout, used l=64, p=0.1 for training and l=1, p=0 for validation and testing, as stated in the paper. However, the BLEU score was reduced by 3.8 than the original BPE model. For subword regularization, experimented with l=64,-1 and various values for alpha from 0.1 to 1. However, the best BLEU score obtained was 0.3 less than the original unigram segmented model.

Since the papers have stated that subword regularization and BPE-dropout increase the BLEU score for low-resource languages, what can be the reasons for these reduced BLEU scores? Is there any way to improve these?

taku910 commented 3 years ago

Generally speaking, Subword regularization needs more iterations (epochs) to train, as it is seen as a data augmentation. How did you set up the learning schedule? Did you optimize it with dev data?

Rashmini commented 3 years ago

I used the same learning rate (increase linearly for 4000 steps and decrease proportionally to the inverse square root of the number of steps) scheduled as Vaswani et al. Trained the models for 200 epochs.

taku910 commented 3 years ago

You might want to try more epochs (say 1000), or early stopping with dev data.