ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

MemorryError in data.py #36

Closed anavc94 closed 5 years ago

anavc94 commented 5 years ago

Hello,

first of all, congrats for the proyect! It has been really useful.

My problem is that when I try to execute data.py with long training text files (about 720.000KB for example), a MemoryError appers. However, I was able to train the model with training text about 200.000KB. Is there any kind of limit? Might I change some parameters? I've monitored my RAM usage and it's about 600MB when executing that script, so I am not sure what is the real problem. I have 16GB RAM

My output is:

Traceback (most recent call last): File "data.py", line 279, in <module> create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE, PRETRAINED_EMBEDDINGS_PATH) File "data.py", line 228, in create_dev_test_train_split_and_vocabulary for line in text: File "E:\Python27\lib\codecs.py", line 699, in next return self.reader.next() File "E:\Python27\lib\codecs.py", line 630, in next line = self.readline() File "E:\Python27\lib\codecs.py", line 553, in readline line += data MemoryError

Thanks in advance!

ottokart commented 5 years ago

Things have changed in the code since you posted this error. Do you still run into this problem? One reason I can imagine is that your file does not have line breaks (so it consists of a single very long line) and the codecs library reads everything into memory when it tries to read the first line.