too many @ in the result

kFoodie commented 5 years ago

I trained two vocabularies with about 900M Chinese-English materials, and then coded two data sets (900M training set and 500K test set) with these two Chinese-English vocabularies.

The training set can get normal results, but there are many @ in the test set.

Before using subword-nmt for bpe, I had participled Chinese.

The corresponding instructions are as follows:

python learn_joint_bpe_and_vocab.py --input data/train.en data/train.zh -s 32000 -o data/bpe32k --write-vocabulary data/vocab.en data/vocab.zh python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/train.en > data/corpus.32k.en python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/train.zh > data/corpus.32k.zh python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/valid.zh > data/aval_bpe_enzh.zh python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/valid.en > data/aval_bpe_enzh.en

The result is shown in the figure.

I don't know where the problem is. Please help me to answer it. Thank you very much.

rsennrich commented 5 years ago

Hey kFoodie,

it looks like you didn't tokenize or word-segment the text before applying subword segmentation. We usually tokenize English text with the Moses tokenizer first, and segment Chinese text with Jieba.

kFoodie commented 5 years ago

I have segment Chinese text…… and the result:

kFoodie commented 5 years ago

I also segment Chinese with jieba……

kFoodie commented 5 years ago

emmm…… I have no problem... Thank you!!

rsennrich commented 5 years ago

glad to hear it's working fine.

BattsetsegB commented 5 years ago

hey, i have same problem,but i used to Mongolian in about 25k Chinese-Mongolian parallel sentence . i tokenized Mongolian and Chinese both text with the Moses tokenizer first, and segment Chinese text with Jieba before applying subword segmentation and i used to preprocessing scripts in http://data.statmt.org/wmt17_systems/training/. When tokenizing it's warning: Tokenizer Version 1.1 Language: zh Number of threads: 1 Tokenizer Version 1.1 Language: mn Number of threads: 1 WARNING: No known abbreviations for language 'mn', attempting fall-back to English version... Tokenizer Version 1.1 Language: zh Number of threads: 1 Tokenizer Version 1.1 Language: mn Number of threads: 1 WARNING: No known abbreviations for language 'mn', attempting fall-back to English version... clean-corpus.perl: processing ./../data/corpus.tok.zh & .mn to ./../data/corpus.tok.clean, cutoff 1-80, ratio 9 .. Input sentences: 22319 Output sentences: 22186 Building prefix dict from the default dictionary ... Dumping model to file cache /tmp/jieba.cache Loading model cost 1.535 seconds. Prefix dict has been built succesfully. Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.190 seconds. Prefix dict has been built succesfully. no pair has frequency >= 2. Stopping Processing ./../data/corpus.bpe.zh Done Processing ./../data/corpus.bpe.mn Done

Mongolian text has too many @@ like this.left is orginal text, middle is tokenized and left orig_tok_out is result. You think its look like didn't tokenized?Please help me to answer it. Thank you very much.

rsennrich commented 5 years ago

The preprocessing scripts in http://data.statmt.org/wmt17_systems/training/ define a minimum frequency threshold for subword units; only units with a frequency > 50 are allowed, and the rest are segmented into smaller units. With a training corpus of 25k sentences, it is possible that only few units meet this criterion.

There might also be other problems though, like an inconsistent use of Cyrillic and Latin characters.

rsennrich / subword-nmt

too many @ in the result #71