Closed jhoelzl closed 8 years ago
Now i removed some code in data.py
and could call the function create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE)
. It gives me back:
26.32% UNK-s in data/train 25.53% UNK-s in data/dev 25.53% UNK-s in data/test
My content is:
data/english.train.txt
:
to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK can you help me ?QUESTIONMARK how was your day ?QUESTIONMARK i like you .PERIOD and what about you ?QUESTIONMARK what is your name ?QUESTIONMARK okay ,COMMA i have to go .PERIOD i have to go now .PERIOD
data/english.dev.txt
:
to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK
data/english.test.txt
:
to be or not to be that is the question hello how are you hello how are you today hello can i help you i am fine thanks what about you can i help you how can i help you do you need help where are you
I suppose the content or structure is not correct, because when i run python main.py english 256 0.02
i always get this error:
256 0.02 Model_english_h256_lr0.02.pcl Building model... Number of parameters is 2040580 WARNING (theano.tensor.blas): We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library. Training... Total number of training labels: 0 Total number of validation labels: 0 Traceback (most recent call last): File "main.py", line 175, in
ppl = np.exp(total_neg_log_likelihood / total_num_output_samples) ZeroDivisionError: division by zero
DATA_PATH = "../data" in data.py is the locaton of the final converted dataset (in the form of pickled arrays). If this directory already exists, then the cenversion is skipped (thus, in case of failed conversion, the directory should be removed).
So these files:
...should be somewhere else (e.g. in ../raw_data). Then python data.py ../raw_data should create the ../data directory with the converted files. Also, if you want to experiment with such a small dataset, then you might want to reduce the minibatch size. Otherwise, when the number of samples is less than the minibatch size, the script will create empty files (this is a bug that should not be a problem in a realistic setting).
On 1 September 2016 at 17:00, Josef Hölzl notifications@github.com wrote:
Now i removed some code in data.py and could call the function create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE). It gives me back:
26.32% UNK-s in data/train 25.53% UNK-s in data/dev 25.53% UNK-s in data/test
My content is: data/english.train.txt:
to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK can you help me ?QUESTIONMARK how was your day ?QUESTIONMARK i like you .PERIOD and what about you ?QUESTIONMARK what is your name ?QUESTIONMARK okay ,COMMA i have to go .PERIOD i have to go now .PERIOD
data/english.dev.txt:
to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK
data/english.test.txt:
to be or not to be that is the question hello how are you hello how are you today hello can i help you i am fine thanks what about you can i help you how can i help you do you need help where are you
I suppose the content or structure is not correct, because when i run python main.py english 256 0.02 i always get this error:
256 0.02 Model_english_h256_lr0.02.pcl Building model... Number of parameters is 2040580 WARNING (theano.tensor.blas): We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library. Training... Total number of training labels: 0 Total number of validation labels: 0 Traceback (most recent call last): File "main.py", line 175, in ppl = np.exp(total_neg_log_likelihood / total_num_output_samples) ZeroDivisionError: division by zero
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ottokart/punctuator2/issues/1#issuecomment-244087849, or mute the thread https://github.com/notifications/unsubscribe-auth/AJWV4DbYjR-4pvSjuaJKwcZKCbslo8_yks5qltqRgaJpZM4JyrRS .
Thanks for clarification with the data folders!
I tried a minibatch size of 2,4, 8, 32 or 64, but i still get the error. The problem is:
Total number of training labels: 0 Total number of validation labels: 0
So the variable total_num_output_samples
always stays 0
, therefore in line 175 the division results in
ppl = np.exp(total_neg_log_likelihood / total_num_output_samples) ZeroDivisionError: division by zero
Do i need more training, dev or test samples?
Sorry, I forgot there's another factor that plays a role. To get non-empty batches, all sources (train, dev and test) should contain at least MAX_SEQUENCE_LEN * MINIBATCH_SIZE words. MAX_SEQUENCE_LEN is defined in data.py and is 200 by default.
Okay i got it, thanks!
Hello,
i downloaded the repository and created a new folder
data
inside. Here i put my files:In the documentation it is written:
When i perform
python data.py data
i always get "Data already exists" and without the argument "The path to stage1 source data directory with txt files is missing". Can you please tell me what is the preferred file and folder structure for the data files?Thanks!