shiyuzh2007 / ASR

Apache License 2.0
55 stars 27 forks source link

how to do preprocessing for hkust dataset #1

Closed luweishuang closed 5 years ago

luweishuang commented 5 years ago

I find in your yaml files, you used "tensor2tensor/hkust_ci_phone/src_data/words_s2s.txt.bpe_5000" or something like "hkust_ci_phone/src_data/train_dim80/text.sp.bpe_5000", I want to know how to get this processed data

shiyuzh2007 commented 5 years ago

The format of 'words_s2s.txt.bpe_5000' shows as follow: \<PAD> 10000000000000 \<UNK> 1000000000000 \<S> 100000000000 \</S> 10000000000 的 62025 是 44363 了 23376 嗯 23213 我 22861 啊 22770 呃 22540