No progress in learning

prontiol commented 7 years ago

I am trying to build my network with 21M text file, but whatever I do, it gets stuck at train_loss ~1.6 and does not progress any more. I tried changing:

network size (from 64 to 1024)
number of layers (2 and 3)
batch_size (from 10 to 100)
sequence length (from 10 to 1000)

But nothing helps and I always get my network to stop learning and stuck at about 1.6-1.7 train_loss. How can I diagnose the problem? Can someone advise?

ubergarm commented 7 years ago

I've been playing with this repo for half a day and had good luck training datasets from 500KiB (Donald Trump Tweets) to 11MiB (KJV Bible). With default settings the train_loss drops to around 1.2 quite quickly (10 minutes w/ 1x GPU). Going below that takes longer time with mixed subjective performance gains.

I haven't calculated the theoretical minimum, perhaps 1? Need to read up more and check into the stuff about perplexity as well.

But directly to your question, how does the sample output look after training for an hour or so? Is it beginning to work at least, or just noise?

A few thoughts:

Try training on just 1MB subsection of the training data with num_layers 1 or 2 to start.
Bigger values for rnn_size, num_layers, seq_length take longer to convergence on low train_loss values in my limited experience.
If values are too low, you'll converge quickly on a non-optimal solution.

Good luck! Hope you've had success!

ubergarm commented 7 years ago

I hope you've had better luck, we also just added a number of PRs this weekend. Keep an eye out for a better README w/ descriptions of the parameters.

If you do have some luck feel free to post back here with your settings, data sets, and results!

Thanks

sherjilozair / char-rnn-tensorflow

No progress in learning #84