stefan-it / nmt-en-vi

Neural Machine Translation system for English to Vietnamese (IWSLT'15 English-Vietnamese data)
59 stars 14 forks source link

Help for the corresponding vocab #1

Closed fotwo closed 6 years ago

fotwo commented 6 years ago

Hi, Thanks for sharing this dataset. Could you give me the vocab or any other ways to get the vocab? Because it seems the website 'https://nlp.stanford.edu/projects/nmt/' is not accessible. Thanks a lots!

stefan-it commented 6 years ago

Hi,

the vocab for both English and Vietnamese can be found using archive.org:

https://web.archive.org/web/20170201080953/http://nlp.stanford.edu:80/projects/nmt/data/iwslt15.en-vi/vocab.vi

and

https://web.archive.org/web/20170103233416/http://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/vocab.en

Just skip the Javascript Code at the beginning, then the whole vocabulary is show :)

To manually create such a vocabulary you could use the tokenized data and build a frequency list + adding the tokens <unk>, <s> and </s>.

I hope this helps you :)

fotwo commented 6 years ago

Hi, Thanks a lot for help, it works.

stefan-it commented 6 years ago

The Stanford site is working again, so I'm closing here :)