pascanur / GroundHog

Library for implementing RNNs with Theano
BSD 3-Clause "New" or "Revised" License
198 stars 303 forks source link

No such file or directory : pentree_char_and_word.npz #1

Closed marcoippolito closed 10 years ago

marcoippolito commented 10 years ago

I would like to use your good GroundHog to implement a sentences segmentation task.

In order to understanf how GroundHog library works, I tried to run DT_RNN_Tut.py

But it says: time python DT_RNN_Tut.py Traceback (most recent call last): File "DT_RNN_Tut.py", line 431, in jobman(state, None) File "DT_RNN_Tut.py", line 114, in jobman train_data, valid_data, test_data = get_text_data(state) File "DT_RNN_Tut.py", line 71,pentree_char_and_word.npz in get_text_data can_fit=True) File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 97, in init self.load_files() File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 105, in load_files penn_data = numpy.load(self.path, mmap_mode=mmap_mode) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 370, in load fid = open(file, "rb") IOError: [Errno 2] No such file or directory: '/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz'

What is pentree_char_and_word.npz? And how to make DT_RNN_Tut.py working?

Looking forward to receive your kind helpfull hints. Kind regards. Marco

tomsbergmanis commented 10 years ago

_What is pentree_char_andword.npz?

That's the training file, you need to supply. It is generated by a script similar to this. https://www.dropbox.com/s/kiewfm3s9mfh4u3/generate.py?dl=0 On 10 September 2014 17:27, Marco Ippolito notifications@github.com wrote:

I would like to use your good GroundHog to implement a sentences segmentation task.

In order to understanf how GroundHog library works, I tried to run DT_RNN_Tut.py

But it says: time python DT_RNN_Tut.py Traceback (most recent call last): File "DT_RNN_Tut.py", line 431, in jobman(state, None) File "DT_RNN_Tut.py", line 114, in jobman train_data, valid_data, test_data = get_text_data(state) File "DT_RNN_Tut.py", line 71,pentree_char_and_word.npz in get_text_data can_fit=True) File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 97, in init self.load_files() File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 105, in load_files penn_data = numpy.load(self.path, mmap_mode=mmap_mode) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 370, in load fid = open(file, "rb") IOError: [Errno 2] No such file or directory: '/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz'

What is pentree_char_and_word.npz? And how to make DT_RNN_Tut.py working?

Looking forward to receive your kind helpfull hints. Kind regards. Marco

— Reply to this email directly or view it on GitHub https://github.com/pascanur/GroundHog/issues/1.

kyunghyuncho commented 10 years ago

Thanks for the script!

@tomsbergmanis If it's okay with you, can we (or you) add that script to tutorials/?

marcoippolito commented 10 years ago

thanks for the script also from me.

time python generate.py Constructing the vocabulary .. Traceback (most recent call last): File "generate.py", line 198, in main(get_parser()) File "generate.py", line 75, in main vocab, freqs, freq_wd = construct_vocabulary(dataset, o.oov_rate, o.level) File "generate.py", line 21, in construct_vocabulary fd = open(filename, 'rt') IOError: [Errno 2] No such file or directory: 'path to file/train'

tomsbergmanis commented 10 years ago

Sure. I think it was given by you or someone else from your group.

Sent from my BlackBerry® smartphone

-----Original Message----- From: Kyunghyun Cho notifications@github.com Date: Wed, 10 Sep 2014 09:56:00 To: pascanur/GroundHogGroundHog@noreply.github.com Reply-To: pascanur/GroundHog reply@reply.github.com Cc: tomsbergmanistomsbergmanis@gmail.com Subject: Re: [GroundHog] No such file or directory : pentree_char_and_word.npz (#1)

Thanks for the script!

@tomsbergmanis If it's okay with you, can we (or you) add that script to tutorials/?


Reply to this email directly or view it on GitHub: https://github.com/pascanur/GroundHog/issues/1#issuecomment-55146134

tomsbergmanis commented 10 years ago

For this script you also need to supply your training data - look at the code - filename variable is filled with a dummy value. Sent from my BlackBerry® smartphone

-----Original Message----- From: Marco Ippolito notifications@github.com Date: Wed, 10 Sep 2014 09:59:46 To: pascanur/GroundHogGroundHog@noreply.github.com Reply-To: pascanur/GroundHog reply@reply.github.com Cc: tomsbergmanistomsbergmanis@gmail.com Subject: Re: [GroundHog] No such file or directory : pentree_char_and_word.npz (#1)

thanks for the script also from me.

time python generate.py Constructing the vocabulary .. Traceback (most recent call last): File "generate.py", line 198, in main(get_parser()) File "generate.py", line 75, in main vocab, freqs, freq_wd = construct_vocabulary(dataset, o.oov_rate, o.level) File "generate.py", line 21, in construct_vocabulary fd = open(filename, 'rt') IOError: [Errno 2] No such file or directory: 'path to file/train'


Reply to this email directly or view it on GitHub: https://github.com/pascanur/GroundHog/issues/1#issuecomment-55146652

marcoippolito commented 10 years ago

sorry, may be because I'm a bit tired I didn't get the whole thing. Do I have to download the training data from here? http://mattmahoney.net/dc/textdata.html

that 's because I read at the end of generate.py: "def get_parser(): usage = """ This script parses the wikipedia dataset from http://mattmahoney.net/dc/text.html, and generates more numpy friendly format of the dataset. Please use this friendly formats as temporary forms of the dataset (i.e. delete them after you're done). "

kyunghyuncho commented 10 years ago

Right. The script assumes that you have three text files (train, valid and test). Given those files, this script will generate npz file that can be used by groundhog/dataset/LM_dataset.py.

marcoippolito commented 10 years ago

time python generate.py Constructing the vocabulary .. .. sorting words .. shrinking the vocabulary size Traceback (most recent call last): File "generate.py", line 199, in main(get_parser()) File "generate.py", line 78, in main oov_default = vocab[""] KeyError: ''

real 0m15.787s user 0m13.737s sys 0m2.048s

looking forward to your helpfull hints. Marco

tomsbergmanis commented 10 years ago

comment out this line: oov_default = vocab[""]

and un-comment these: """if o.oov == '-1': oov_default = -1 else: oov_default = len(vocab)"""

Also - try to read the code and understand it, as the you won't have that many questions. GH code will be more difficult to comprehend.

On 10 September 2014 18:28, Marco Ippolito notifications@github.com wrote:

time python generate.py Constructing the vocabulary .. .. sorting words .. shrinking the vocabulary size Traceback (most recent call last): File "generate.py", line 199, in main(get_parser()) File "generate.py", line 78, in main oov_default = vocab[""] KeyError: ''

real 0m15.787s user 0m13.737s sys 0m2.048s

— Reply to this email directly or view it on GitHub https://github.com/pascanur/GroundHog/issues/1#issuecomment-55150961.

marcoippolito commented 10 years ago

Sorry for asking you to help me debugging once again. But I came to a dead point, which prevents me to use your library (a true pity for all of us...isn't it?)

Here is the error message got:

time python generate.py Constructing the vocabulary .. .. sorting words .. shrinking the vocabulary size EOL 0 Constructing train set o.n_chains= 1 Constructing valid set Constructing test set Saving data Killed

real 0m49.369s user 0m36.546s sys 0m8.769s

these are the lines of generate.py which could be linked to the problem:

print 'Saving data'

numpy.savez(o.dest,
            train_words=train,
            valid_words=valid,
            test_words=test,
            oov=oov_default,
            freqs = numpy.array(freqs),
            n_words=len(vocab),
            n_chars=0,  # I ran generate.py also after commenting this line, but the saving is still killed
            vocabulary = vocab,
            freq_wd = freq_wd
           )
inv_map = {v:k for k, v in vocab.items()}

numpy.savez(o.dest+"_dict", unique_words=inv_map)
print '... Done'

A file :tmp_data.npz is produced. When running DT_RNN_Tut.py the resulting error message is:

Traceback (most recent call last): File "DT_RNN_Tut.py", line 431, in jobman(state, None) File "DT_RNN_Tut.py", line 114, in jobman train_data, valid_data, test_data = get_text_data(state) File "DT_RNN_Tut.py", line 71, in get_text_data can_fit=True) File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 102, in init self.load_files() File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 112, in load_files penn_data = numpy.load("tmp_data.npz", mmap_mode=mmap_mode) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 388, in load return NpzFile(fid, own_fid=tmp) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 192, in init _zip = zipfile_factory(fid) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 131, in zipfile_factory return zipfile.ZipFile(_args, *_kwargs) File "/usr/lib/python2.7/zipfile.py", line 770, in init self._RealGetContents() File "/usr/lib/python2.7/zipfile.py", line 811, in _RealGetContents raise BadZipfile, "File is not a zip file" zipfile.BadZipfile: File is not a zip file Exception AttributeError: "'NpzFile' object has no attribute 'zip'" in <bound method NpzFile.del of <numpy.lib.np yio.NpzFile object at 0x7fa7c0602650>> ignored

Looking forward to your kind hints. Kind regards. Marco

kyunghyuncho commented 10 years ago

I have pushed generate.py to the LISA fork of GroundHog (https://github.com/lisa-groundhog/GroundHog). See tutorials directory there.

Effectively, you can generate a data file, assuming that you have plan text files, {path}/train, {path}/valid and {path}/test, by

python generate.py --dest=data_chars --level=chars --oov-rate=5 --dtype=int64 {path} python generate.py --dest=data_words --level=words --oov-rate=5 --dtype=int64 {path}

Obviously, afterward, you need to fix state['path'] and state['dictionary'] accordingly.