tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.37k stars 1.96k forks source link

While creating the pre process data getting warnings that resulted in special characters in the vocab file #316

Open frajos100 opened 6 years ago

frajos100 commented 6 years ago

While creating the pre process data getting warnings that resulted in special characters in the vocab file that needed to be removed for the French to English model training to be started. Special Characters: 气 过 解 遇 賃@@ 貸@@ 庸 尙@@ 書 於 抵@@ Warnings were: Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/tokenizer/tokenizer.perl line 138, line 4016000. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4502089. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4502102. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4502103. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4502106. ..........(4600000)..........(4700000)..........(4800000)..Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4823209. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4823237. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4823242. ........(4900000)..Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4922282. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4922282. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4922282. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 4922282. ........(5000000).Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 5017824. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 5017839. ...Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 5049962. Unicode non-character U+FDD3 is not recommended for open interchange in print at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 140, line 5049968. ....tmp/wmt16fr_new/train.tok.en is too short! at tmp/wmt16fr_new/mosesdecoder/scripts/training/clean-corpus-n.perl line 96, line 5088000.

vikrant97 commented 6 years ago

I have the same issue. Can anyone please help, how to automatically remove such non utf-8 characters from the input file?