keras-team / keras-applications

Reference implementations of popular deep learning models.
Other
2k stars 910 forks source link

keras lstm text generation, dealing with a large dataset possibly using batches #75

Closed sam-thecoder closed 5 years ago

sam-thecoder commented 5 years ago

I'm making my own version of keras text generator based on this example https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py and my code the same https://bitbucket.org/muiruri_samuel/rap-generator/src/master/char_lstm.py only working with a different dataset, some rap lyrics.

At first the first txt file was about .5 mbs and with about 100k lines of code however I found a kaggle dataset with rap lyrics from multiple artists and after populating my text file with the content the file size bloated to 350 mb with from the print statements when the script starts I believe over 350 million lines of text.

The big disadvantage though is it would need more RAM to run but I was willing to train on the cloud but even with a 120 GB ram server it crashes before it exhausts the RAM usage.

I know this sounds weird so I also recorded this running with htop also to show it crashes before it even uses 40 GB or RAM with a MemoryError

muiruri_samuel@instance-1:~/rap-generator$ python char_lstm_new.py Using TensorFlow backend. corpus length: 306514394 total chars: 359 nb sequences: 102171418 Vectorization... Traceback (most recent call last): File "char_lstm_new.py", line 44, in x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool) MemoryError

The screen record here: https://youtu.be/i7nWeJEYavY

I believe the best way to go around this would be using batches and also this doesn't include a testing set. I know it's asking a lot but as I'm learning keras this is my working project and I like to learn practically.

taehoonlee commented 5 years ago

@Six-wars, It is hard for me to follow your points. Do you want to make a PR for replacing model.fit with model.fit_generator? Otherwise, are you asking us to replace them? We don't have plans to revise the example, and would be very pleased if you could contribute to Keras.

sam-thecoder commented 5 years ago

Okay I'll work on my edit and push it later. I assumed possibly you could have a guide on this anyway I'll start on the first hurdle that is the x and y population.

On Tue, Feb 19, 2019, 08:43 Taehoon Lee <notifications@github.com wrote:

@Six-wars https://github.com/Six-wars, It is hard for me to follow your points. Do you want to make a PR for replacing model.fit with model.fit_generator? Otherwise, are you asking us to replace them? We don't have plans to revise the example, and would be very pleased if you could contribute to Keras.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras-applications/issues/75#issuecomment-464992353, or mute the thread https://github.com/notifications/unsubscribe-auth/AO8NBXPj8HKaS6rBcV-obRmyTrNu34dNks5vO47kgaJpZM4bASwE .

taehoonlee commented 5 years ago

PRs are always welcome. And you'd better to make a thread on Keras, not here Keras-applications.

ScarletMcLearn commented 3 years ago

@sam-thecoder can you please tell how you dealt with the memory issue when dealing with large dataset?

Thank you.