ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

Problems with Initialization #1

Closed jhoelzl closed 8 years ago

jhoelzl commented 8 years ago

Hello,

i downloaded the repository and created a new folder data inside. Here i put my files:

In the documentation it is written:

the conversion can be initiated with python data.py <data_dir>

When i perform python data.py data i always get "Data already exists" and without the argument "The path to stage1 source data directory with txt files is missing". Can you please tell me what is the preferred file and folder structure for the data files?

Thanks!

jhoelzl commented 8 years ago

Now i removed some code in data.py and could call the function create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE). It gives me back:

26.32% UNK-s in data/train 25.53% UNK-s in data/dev 25.53% UNK-s in data/test

My content is: data/english.train.txt:

to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK can you help me ?QUESTIONMARK how was your day ?QUESTIONMARK i like you .PERIOD and what about you ?QUESTIONMARK what is your name ?QUESTIONMARK okay ,COMMA i have to go .PERIOD i have to go now .PERIOD

data/english.dev.txt:

to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK

data/english.test.txt:

to be or not to be that is the question hello how are you hello how are you today hello can i help you i am fine thanks what about you can i help you how can i help you do you need help where are you

I suppose the content or structure is not correct, because when i run python main.py english 256 0.02 i always get this error:

256 0.02 Model_english_h256_lr0.02.pcl Building model... Number of parameters is 2040580 WARNING (theano.tensor.blas): We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library. Training... Total number of training labels: 0 Total number of validation labels: 0 Traceback (most recent call last): File "main.py", line 175, in ppl = np.exp(total_neg_log_likelihood / total_num_output_samples) ZeroDivisionError: division by zero

ottokart commented 8 years ago

DATA_PATH = "../data" in data.py is the locaton of the final converted dataset (in the form of pickled arrays). If this directory already exists, then the cenversion is skipped (thus, in case of failed conversion, the directory should be removed).

So these files:

...should be somewhere else (e.g. in ../raw_data). Then python data.py ../raw_data should create the ../data directory with the converted files. Also, if you want to experiment with such a small dataset, then you might want to reduce the minibatch size. Otherwise, when the number of samples is less than the minibatch size, the script will create empty files (this is a bug that should not be a problem in a realistic setting).

On 1 September 2016 at 17:00, Josef Hölzl notifications@github.com wrote:

Now i removed some code in data.py and could call the function create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE). It gives me back:

26.32% UNK-s in data/train 25.53% UNK-s in data/dev 25.53% UNK-s in data/test

My content is: data/english.train.txt:

to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK can you help me ?QUESTIONMARK how was your day ?QUESTIONMARK i like you .PERIOD and what about you ?QUESTIONMARK what is your name ?QUESTIONMARK okay ,COMMA i have to go .PERIOD i have to go now .PERIOD

data/english.dev.txt:

to be ,COMMA or not to be ,COMMA that is the question .PERIOD hello ,COMMA how are you .PERIOD hello ,COMMA how are you today ?QUESTIONMARK hello ,COMMA can i help you ?QUESTIONMARK i am fine ,COMMA thanks .PERIOD what about you ?QUESTIONMARK can i help you ?QUESTIONMARK how can i help you ?QUESTIONMARK do you need help ?QUESTIONMARK where are you ?QUESTIONMARK

data/english.test.txt:

to be or not to be that is the question hello how are you hello how are you today hello can i help you i am fine thanks what about you can i help you how can i help you do you need help where are you

I suppose the content or structure is not correct, because when i run python main.py english 256 0.02 i always get this error:

256 0.02 Model_english_h256_lr0.02.pcl Building model... Number of parameters is 2040580 WARNING (theano.tensor.blas): We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library. Training... Total number of training labels: 0 Total number of validation labels: 0 Traceback (most recent call last): File "main.py", line 175, in ppl = np.exp(total_neg_log_likelihood / total_num_output_samples) ZeroDivisionError: division by zero

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ottokart/punctuator2/issues/1#issuecomment-244087849, or mute the thread https://github.com/notifications/unsubscribe-auth/AJWV4DbYjR-4pvSjuaJKwcZKCbslo8_yks5qltqRgaJpZM4JyrRS .

jhoelzl commented 8 years ago

Thanks for clarification with the data folders!

I tried a minibatch size of 2,4, 8, 32 or 64, but i still get the error. The problem is:

Total number of training labels: 0 Total number of validation labels: 0

So the variable total_num_output_samples always stays 0, therefore in line 175 the division results in

ppl = np.exp(total_neg_log_likelihood / total_num_output_samples) ZeroDivisionError: division by zero

Do i need more training, dev or test samples?

jhoelzl commented 8 years ago

I also tried the raw_data provided in punctuator version 1, but still the same error.

In the function get_minibatch, there is something wrong with the dev dataset, because the loop

for subsequence in dataset:

is never started and therefore total_num_output_samples in line 173 is zero.

ottokart commented 8 years ago

Sorry, I forgot there's another factor that plays a role. To get non-empty batches, all sources (train, dev and test) should contain at least MAX_SEQUENCE_LEN * MINIBATCH_SIZE words. MAX_SEQUENCE_LEN is defined in data.py and is 200 by default.

jhoelzl commented 8 years ago

Okay i got it, thanks!