Closed marcoippolito closed 10 years ago
_What is pentree_char_andword.npz?
That's the training file, you need to supply. It is generated by a script similar to this. https://www.dropbox.com/s/kiewfm3s9mfh4u3/generate.py?dl=0 On 10 September 2014 17:27, Marco Ippolito notifications@github.com wrote:
I would like to use your good GroundHog to implement a sentences segmentation task.
In order to understanf how GroundHog library works, I tried to run DT_RNN_Tut.py
But it says: time python DT_RNN_Tut.py Traceback (most recent call last): File "DT_RNN_Tut.py", line 431, in jobman(state, None) File "DT_RNN_Tut.py", line 114, in jobman train_data, valid_data, test_data = get_text_data(state) File "DT_RNN_Tut.py", line 71,pentree_char_and_word.npz in get_text_data can_fit=True) File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 97, in init self.load_files() File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py", line 105, in load_files penn_data = numpy.load(self.path, mmap_mode=mmap_mode) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 370, in load fid = open(file, "rb") IOError: [Errno 2] No such file or directory: '/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz'
What is pentree_char_and_word.npz? And how to make DT_RNN_Tut.py working?
Looking forward to receive your kind helpfull hints. Kind regards. Marco
— Reply to this email directly or view it on GitHub https://github.com/pascanur/GroundHog/issues/1.
Thanks for the script!
@tomsbergmanis If it's okay with you, can we (or you) add that script to tutorials/?
thanks for the script also from me.
time python generate.py
Constructing the vocabulary ..
Traceback (most recent call last):
File "generate.py", line 198, in
Sure. I think it was given by you or someone else from your group.
Sent from my BlackBerry® smartphone
-----Original Message----- From: Kyunghyun Cho notifications@github.com Date: Wed, 10 Sep 2014 09:56:00 To: pascanur/GroundHogGroundHog@noreply.github.com Reply-To: pascanur/GroundHog reply@reply.github.com Cc: tomsbergmanistomsbergmanis@gmail.com Subject: Re: [GroundHog] No such file or directory : pentree_char_and_word.npz (#1)
Thanks for the script!
@tomsbergmanis If it's okay with you, can we (or you) add that script to tutorials/?
Reply to this email directly or view it on GitHub: https://github.com/pascanur/GroundHog/issues/1#issuecomment-55146134
For this script you also need to supply your training data - look at the code - filename variable is filled with a dummy value. Sent from my BlackBerry® smartphone
-----Original Message----- From: Marco Ippolito notifications@github.com Date: Wed, 10 Sep 2014 09:59:46 To: pascanur/GroundHogGroundHog@noreply.github.com Reply-To: pascanur/GroundHog reply@reply.github.com Cc: tomsbergmanistomsbergmanis@gmail.com Subject: Re: [GroundHog] No such file or directory : pentree_char_and_word.npz (#1)
thanks for the script also from me.
time python generate.py
Constructing the vocabulary ..
Traceback (most recent call last):
File "generate.py", line 198, in
Reply to this email directly or view it on GitHub: https://github.com/pascanur/GroundHog/issues/1#issuecomment-55146652
sorry, may be because I'm a bit tired I didn't get the whole thing. Do I have to download the training data from here? http://mattmahoney.net/dc/textdata.html
that 's because I read at the end of generate.py: "def get_parser(): usage = """ This script parses the wikipedia dataset from http://mattmahoney.net/dc/text.html, and generates more numpy friendly format of the dataset. Please use this friendly formats as temporary forms of the dataset (i.e. delete them after you're done). "
Right. The script assumes that you have three text files (train, valid and test). Given those files, this script will generate npz file that can be used by groundhog/dataset/LM_dataset.py.
time python generate.py
Constructing the vocabulary ..
.. sorting words
.. shrinking the vocabulary size
Traceback (most recent call last):
File "generate.py", line 199, in
real 0m15.787s user 0m13.737s sys 0m2.048s
looking forward to your helpfull hints. Marco
comment out this line:
oov_default = vocab["
and un-comment these: """if o.oov == '-1': oov_default = -1 else: oov_default = len(vocab)"""
Also - try to read the code and understand it, as the you won't have that many questions. GH code will be more difficult to comprehend.
On 10 September 2014 18:28, Marco Ippolito notifications@github.com wrote:
time python generate.py Constructing the vocabulary .. .. sorting words .. shrinking the vocabulary size Traceback (most recent call last): File "generate.py", line 199, in main(get_parser()) File "generate.py", line 78, in main oov_default = vocab[""] KeyError: ''
real 0m15.787s user 0m13.737s sys 0m2.048s
— Reply to this email directly or view it on GitHub https://github.com/pascanur/GroundHog/issues/1#issuecomment-55150961.
Sorry for asking you to help me debugging once again. But I came to a dead point, which prevents me to use your library (a true pity for all of us...isn't it?)
Here is the error message got:
time python generate.py Constructing the vocabulary .. .. sorting words .. shrinking the vocabulary size EOL 0 Constructing train set o.n_chains= 1 Constructing valid set Constructing test set Saving data Killed
real 0m49.369s user 0m36.546s sys 0m8.769s
these are the lines of generate.py which could be linked to the problem:
print 'Saving data'
numpy.savez(o.dest,
train_words=train,
valid_words=valid,
test_words=test,
oov=oov_default,
freqs = numpy.array(freqs),
n_words=len(vocab),
n_chars=0, # I ran generate.py also after commenting this line, but the saving is still killed
vocabulary = vocab,
freq_wd = freq_wd
)
inv_map = {v:k for k, v in vocab.items()}
numpy.savez(o.dest+"_dict", unique_words=inv_map)
print '... Done'
A file :tmp_data.npz is produced. When running DT_RNN_Tut.py the resulting error message is:
Traceback (most recent call last):
File "DT_RNN_Tut.py", line 431, in
Looking forward to your kind hints. Kind regards. Marco
I have pushed generate.py to the LISA fork of GroundHog (https://github.com/lisa-groundhog/GroundHog). See tutorials directory there.
Effectively, you can generate a data file, assuming that you have plan text files, {path}/train, {path}/valid and {path}/test, by
python generate.py --dest=data_chars --level=chars --oov-rate=5 --dtype=int64 {path} python generate.py --dest=data_words --level=words --oov-rate=5 --dtype=int64 {path}
Obviously, afterward, you need to fix state['path'] and state['dictionary'] accordingly.
I would like to use your good GroundHog to implement a sentences segmentation task.
In order to understanf how GroundHog library works, I tried to run DT_RNN_Tut.py
But it says: time python DT_RNN_Tut.py Traceback (most recent call last): File "DT_RNN_Tut.py", line 431, in
jobman(state, None)
File "DT_RNN_Tut.py", line 114, in jobman
train_data, valid_data, test_data = get_text_data(state)
File "DT_RNN_Tut.py", line 71,pentree_char_and_word.npz in get_text_data
can_fit=True)
File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py",
line 97, in init
self.load_files()
File "/home/ubuntu/ggc/prove/DRNN/GroundHog-master/groundhog/datasets/LM_dataset.py",
line 105, in load_files
penn_data = numpy.load(self.path, mmap_mode=mmap_mode)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py",
line 370, in load
fid = open(file, "rb")
IOError: [Errno 2] No such file or directory:
'/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz'
What is pentree_char_and_word.npz? And how to make DT_RNN_Tut.py working?
Looking forward to receive your kind helpfull hints. Kind regards. Marco