sherjilozair / char-rnn-tensorflow

Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow
MIT License
2.64k stars 957 forks source link

MemoryError #30

Open ghost opened 8 years ago

ghost commented 8 years ago

When training on large files, I get a MemoryError despite having more than enough memory to hold the file:

reading text file Traceback (most recent call last): File "train.py", line 111, in main() File "train.py", line 48, in main train(args) File "train.py", line 51, in train data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length) File "/home/ren/Projects/char-rnn-tensorflow/utils.py", line 18, in init self.preprocess(input_file, vocab_file, tensor_file) File "/home/ren/Projects/char-rnn-tensorflow/utils.py", line 35, in preprocess self.tensor = np.array(list(map(self.vocab.get, data))) MemoryError

izqui commented 8 years ago

Happens to me too, the current implementation needs about 20 times as RAM as the size of the input file. 500MB of input train fine using something in the neighborhood of 10GB of RAM

sherjilozair commented 8 years ago

Thanks for the report. @Alicemargatroid @izqui. I need some help figuring out the right way to fix this problem.

How big is the data.npy file? Is it 20 times large as well?

Should we optimize the DS or switch to a streaming loader?

ghost commented 8 years ago

@sherjilozair I think a streaming loader would be best

izqui commented 8 years ago

@sherjilozair Right now char-rnn is using 13.54 GB of RAM and this is the size of the data files:

-rw-r--r-- 1 root root 6254212000 Jun  6 16:50 data.npy
-rw-r--r-- 1 root root  781776490 Jun  6 16:22 input.txt
-rw-r--r-- 1 root root       1357 Jun  6 16:47 vocab.pkl