sherjilozair / char-rnn-tensorflow

Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow
MIT License
2.64k stars 960 forks source link

Interface to save checkpoint? #59

Closed wrapperband closed 7 years ago

wrapperband commented 7 years ago

The interface has been changed from char-rnn, so I didn't set the save checkpoint correctly. I've got a larger nn size started with tensor-rnn and it looks like it won't do a checkpoint till 50 epochs.

It's currently on 36 and is calculating 2 epochs per day (250 buffer), ie another week.

Looking at the learning rate it would have been great to save a checkpoint "on demand" i.e. via keyboard command at certain points, 18 epochs, being the point where the training leveled of, in this case.

julien-c commented 7 years ago

You're either training on a massive input dataset, or on a very low-compute power device :)

ubergarm commented 7 years ago

Unfortunately for you situation, it would have been nice if you had starting training with --save_every 10 or similar. That does seem slow.

Since there is no way to change this after training has started, I suggest trying a very short "throw away run" to estimate the speed and then start over with the desired number of epochs and saves.

You can now use tensorboard too to visualize the training and compare speeds across various runs to better tune performance / speed.

Thanks!

wrapperband commented 7 years ago

re : the time. It was 1400 neuron net, 200,000 lines of data 2200 "pages" , slightly less than Lord of the Rings, on a R9 290.