mit-nlp / MITIE

MITIE: library and tools for information extraction
2.92k stars 535 forks source link

wordrep.exe has stopped working #183

Closed cmllmrnn closed 6 years ago

cmllmrnn commented 6 years ago

Expected Behavior

Running wordrep should be fine as long as I have enough RAM.

Current Behavior

In my case, my machine has 160GB total memory (m4.10xlarge AWS instance size). I started training 4GB, 1GB, and even 339MB data but when the wordrep process reach around 25GB of RAM ("Now do CCA (left size: 50000000, right size: 50000000)." in the console), a dialog box appears and says wordrep.exe has stopped working. It works fine only for my 13MB data.

Steps to Reproduce

  1. I trained a 4GB, 1.8GB, 1GB, and 339MB data separately but wordrep crashed on all of them. I didn't modify anything in the build project or how I ran it. I ran wordrep normally.
  2. I tried to run Command Prompt as Administrator and ran wordrep normally but to no avail.
  3. I made wordrep's priority in the Task Manager as High so it will be the first one to consume much RAM but it still crashed.
  4. I edited the properties of wordrep (right click > Properties > Compatibility) and checked Run this program as an administrator but nope.
  5. I ran wordrep in compatibility mode for Windows 8 (because that's what troubleshooting suggested) also in the Properties tab but all the data I am trying to train halted.

Maybe I am missing something out. Thank you in advance for the help.

davisking commented 6 years ago

Post the exact commands you used to compile wordrep.

cmllmrnn commented 6 years ago

I ran the following when I'm in the MITIE-master folder:

cd tools/wordrep mkdir build cd build cmake "Visual Studio 15 2017 Win64" .. cmake --build . --config Release

davisking commented 6 years ago

That should work fine. I don't know what is going on. If you can post a small dataset that reproduces the error I'll take a look. Give exact instructions for reproducing the problem.

cmllmrnn commented 6 years ago

For small datasets, wordrep works fine. GitHub only allows files smaller than 10MB data and I have a 13MB data that worked fine. Since I cannot post it here, I'll just give the URL. This is the next smallest decent French corpus I found online: http://www.statmt.org/europarl/v7/fr-en.tgz with size 339MB that failed.

  1. Once downloaded, extract the file and keep just the europarl-v7.fr-en.fr file
  2. Open command prompt
    cd C:\Users\cmllmrnn\Documents\MITIE Workspace\MITIE-master\tools\wordrep\build\Release
    wordrep -e "C:\path\to\file\train"
  3. A dialog box opens saying wordrep.exe has stopped working. It appears when these are in the console:
    number of raw ASCII files found: 1
    num words: 200000
    saving word counts to top_word_counts.dat
    number of raw ASCII files found: 1
    Sample 50000000 random context vectors
    Now do CCA (left size: 50000000, right size: 50000000).
davisking commented 6 years ago

Thanks, I'll take a look.

davisking commented 6 years ago

Try it now. I just pushed a fix and it should all work as expected in Visual Studio now.

cmllmrnn commented 6 years ago

Thanks a lot! Will compile and train and will let you know as soon as possible.

davisking commented 6 years ago

No problem. Thanks for reporting this.

cmllmrnn commented 6 years ago

I was able to train a 4GB data successfully. It took 3 days. Thank you!