ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

Memory Error #45

Closed anavc94 closed 5 years ago

anavc94 commented 5 years ago

Hello,

I've a dataset splitted into train, test and dev which consists in about 24M lines for train, 500k for test and 500k for dev. The 24M lines are splitted in 49 different files, about 500k each one, with a newline '\n' character between each phrase.

When trying to execute the script data.py with this dataset, this kind of error appear with the TRAIN_FILES:

File "data.py", line 285, in <module> create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE, PRETRAINED_EMBEDDINGS_PATH) File "data.py", line 257, in create_dev_test_train_split_and_vocabulary write_processed_dataset(train_txt_files, train_output) File "data.py", line 192, in write_processed_dataset data.append(subsequence) MemoryError

However, when calling the function write_processed_dataset with the test and dev dataset, everything works fine. I've already trained the punctuator with smaller datasets.

Can you please suggest me how to deal with this error?

Thank you!

ottokart commented 5 years ago

Hi!

you could change the data.py script to write each subsequence directly to the output file instead of holding the entire training set in memory - e.g., replace data.append(subsequence) with something like output_f.write("%s\n" % repr(subsequence)) and open the output_f at line 122 with open(output_file, 'w') as output_f:. Then you can remove the data variable and dump(data, output_file) part.

But then you would still run into problems when training because all the data is loaded into memory then anyway. Would need to rewrite get_minibatch in main.py to fix that (replace in memory shuffling with something that can read random lines from the training file quickly enough).

Much simpler would be to get more RAM :)

anavc94 commented 5 years ago

Thank you so much, @ottokart , helped me a lot. I did what you said to prepare the data with data.py and while training no errors appeared.

Ana