nyu-dl / dl4mt-tutorial

BSD 3-Clause "New" or "Revised" License
618 stars 249 forks source link

How I can run the language model #46

Closed jifan-chen closed 8 years ago

jifan-chen commented 8 years ago

Hi, I think it is just a simple question.

I'm new to dl4mt, and I wonder how I can run the neural language model of session0, since I can't find the code to download the wiki data needed.

Thanks.

jli05 commented 8 years ago

I had to download some wiki dump, extract the text and tokenise it. It'd be great if someone could put the data files online.

  1. Download the wiki dump: go to https://dumps.wikimedia.org/enwiki/20160305/ or https://dumps.wikimedia.org/simplewiki/20160305/, download the first file or the file named xxxx-abstract.xml.
  2. The page dump file and the abstracts dump file follow different xml format. I wrote https://gist.github.com/jli05/99741bd4ba6844acc627 and https://gist.github.com/jli05/5f18e6f29174e7f1d8a5 to extract the text.
  3. Tokenise the extracted text. Refer to data/preprocess.sh and data/tokenize_all.sh for the usage of tokenizer.perl.
jifan-chen commented 8 years ago

Thanks a lot for the reply, I shall have a try.