sherjilozair / char-rnn-tensorflow

Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow
MIT License
2.64k stars 960 forks source link

Results quite unsatisfactory #91

Open STNBRT opened 7 years ago

STNBRT commented 7 years ago

First and foremost thanks to everybody involved in this. I really appreciate the work you are putting into this.

Previously I was using Karpathy's char-rnn but I couldn't get torch running with my gpu after updating my hardware so I was looking for a different solution and that has brought me here. Using Karpathy's rnn I was getting beautiful results with even very small datasets (around 1MB). With your tensorflow implementation the results are not so good and I wonder why. I tried fiddle around with the parameters (rnn_size, num_layers, etc) but the improvements were little or nonexistent.

It would be really cool if you could add some explanatory comments to the different parameters aka how they will affect the result. For me being relatively new to NNs, this would help a lot in getting better results.

Thanks again for your efforts!

ubergarm commented 7 years ago

@virxxiii thanks for your feedback. I too am interested in getting good results out of this project and comparing it to other solutions.

Just merged PR #73 which gives you the default values for all of the parameters which is a start.

Hopefully I'll start a table in the README to give a brief description of each parameter and a link to a journal article for more details etc.

Also hope to add a Perplexity calculation which might be useful in comparing results across various architectures / models.

I've had good results anecdotally with this system training on a baby names set using something e.g.:

python train.py --data_dir data/babynames/ \
                         --num_layers 4 \
                         --model nas \
                         --seq_length 12 \
                         --num_epochs 200

I then let it train for about 12 hours on my GeForce GTX 1070 until the train_loss was below 1.0.

You can pick the number of epochs based on the speed of training to essentially "train for X hours".

Finally, we just added dropout so you could play around with adding:

--input_keep_prob 0.8 --output_keep_prob 0.5

Though its not been tested much so YMMV.

It'd be great to start collecting some real results / Tensorboard graphs / CLI parameters examples too.

Just some thoughts, thanks!

hoomanNo5 commented 7 years ago

Hi, I'm struggling to get good results as well. This is my first time using a LSTM RNN and I'm relatively new (cliche response) but I've gotten human readable responses after training with my dataset using other RNNs. I get only garbage here. A couple things I've noticed that are definitely bad but I don't know why yet:

Related: ItC Spitlight Map Code EDditional Citi Idgy of 4014 wetween from all expressways 0,06-13011-163 Q.72 Goidlas 8200(e. Sloade. Ective Stall 2 8 Manofacture hearning? and all value
(Fon-Plapeceat.

Remark:
Fir the patential and trendst the complete stringly describe. This are set to as emengage, ot accessible to exphent uncrades, the path learning technologies cor hapdlire and traditional system of mears behaviorarizing, as any crissify-area ipportunity of antimative suarced available, are noted in the retail. They lomely as empreesly, fulls of Di.

Ringe is bith clifk-some actian. The Uthernet of the productivity is to simply approval the law state the around matore, incormation/performance mas l'



I think it's converging too fast as you said in previous posts. How would you suggest I get better output?

Thanks.
ubergarm commented 7 years ago

@hoomanNo5

Your setup sounds good, (i use the same hardware/ TF-gpu)...

Here are my suggestions:

  1. Your rnn_size and number of layers are kind of big for the small amount of input data.
    Dataset sizes: Note that if your data is too small (1MB is already considered very small) the RNN won't learn very effectively. Remember that it has to learn everything completely from scratch. Conversely if your data is large (more than about 2MB), feel confident to increase rnn_size and train a bigger model (see details of training below). It will work significantly better. For example with 6MB you can easily go up to rnn_size 300 or even more. 

Also the dropout feature is new and from what I can tell is most useful only when you don't want to overfit your data, but not the current problem.

So I'd suggest starting with the basic settings and see how that goes e.g.:

python3.5 train.py --data_dir=./data/Nav/

The defaults are pretty good for your amount of data. ( --rnn_size 128 and --num_layers 2 are default. )

The next thing I'd consider increasing is perhaps --seq_length 100 up from the default of 50. That will help the short term memory effect last across a longer string of characters.

Please report back with your findings!

ubergarm commented 7 years ago

Also, people have had good success with this specific repo, here is an example of generating music in ABC notation to midi: https://maraoz.com/2016/02/02/abc-rnn/ I re-created this musical experiment myself on the latest version and while the songs aren't great, they all were playable as midi files without hand tweaking!

Tuning your models is kind of a "dark art" at this point. My best advice in general is:

  1. Start with as much clean input.txt as possible e.g. 50MiB
  2. Start by establishing a baseline using the default settings.
  3. Use tensorboard to compare all of your runs visually to aid in experimenting.
  4. Tweak --rnn_size up somewhat if you have a lot of input data.
  5. Tweak --num_layers from 2 to 3 but no higher unless you have experience.
  6. Tweak --seq_length up based on the length of a valid input string (e.g. names are <= 12 characters, sentences may be up to 64 characters, etc). An lstm cell will "remember" for durations longer than this sequence, but the effect falls off for longer character distances.
  7. Finally once you've done all that, only then would I suggest adding some dropout. Start with --output_keep_prob 0.8 and maybe end up with both --input_keep_prob 0.8 --output_keep_prob 0.5 only after exhausting all the above valules.

Happy searching!

hoomanNo5 commented 7 years ago

You're right about the size. I actually had 4.3MB in my corpus but I removed a big chunk of it because I thought it was too different from the rest of my corpus.

I will remove the dropout to see if it gets better results. I will definitely increase the sequence length. Will that drastically increase the iteration calculation time?

One of my nagging questions in my head is, "what is clean data for this LSTM? Should all the TXT files be in the same format? Or can it vary? e.g. movie scripts cannot be used with whitepapers or technical specifications.

ubergarm commented 7 years ago

@hoomanNo5 another interesting post I read about mixing input from various authors:

https://www.gwern.net/RNN%20metadata

Technically you can mix anything you want into a big input.txt. It really depends on what you're trying to get out because: whatever you put into the system, is what you'll get out of the system.

If you use a lot of different formats in one training, then the model may "waste some space" learning to mimic each of the different formatting styles. So the most "efficient system" would have consistent style of input assuming you want consistent style of output.

STNBRT commented 7 years ago

Thanks everybody for your input!

I have finally managed to get waaaay better (actually really good) results by reducing the num_layers to 3. It seems to me that anything above 3 results in drastically worse results. I don't know why this is, I'm just telling you what I observed. It might help you.

As for the dropout: I'd really appreciate if you could give me a short explanation about how it will affect training?

hoomanNo5 commented 7 years ago

@ubergarm So here is what I did:

b' autonomous driving mode, price-radiring, IP agreement, division of a point control of future rationalization and metep players for the human body of the potential grade, analyzing share data via the entry of datasets. We describe the Toyota tomer an arrows of "mobile words". Unlike HAV systems complete with the state-of-the-attery datasets for faster file approaches, mapping steing object, and the three dipensic business of codes being shown.

  1. Continental as queries and evaluation of passengers are the models where sure used the acquisition of 4114th. For example, Technology and Telematics Standard 7 processors training the number of fixed science Solution ? as all this process move to these approaches to be Jinimistrat, KCCA-based models, the 3M particular growing mechanism. A numbered image is decreasing by one way tested affered threats of (9) consumers by wall with modern multi-domain staking estimation headunits (e.g. Chinese driving for inerrut products) regarding the buttle a reasonable smart level.

3.3 Summal Amphential Tracking:

  • Furthermore and Time Slow: Tow rest pimpline of an expressing the sensitive volume at 20%-distance Applications such as predictions and a surveyted cost estimation rather than in 2015.
    The First 8Guech zone cars includes a rear-seat fame, you avoid CO3.
  • The second largest capability of autonomous technologies can be revising ADAS belief a scale to driving response toward the network estimate availability to wet 1 (0) Time the filter of the network message of traffic states. You have a primary, or Japans which hun to ensure that, in the data that your in-vehicle opong on sharing at this process. IEEE Auto7 (SAAR 2014-2026-2024), the NLTSA solution would end-to-reach an image wave. Organization is extremely being responsible to enter growth rates. The market are you now not yet intended to identify this before not evolve, such as data much aligned within the just over-the vehicle and that leverages the auto-tuning bo'

A couple of questions come to mind:

  1. the " ? " sequence is something that I thought I cleaned out of my input.txt. I did leave the old versions of input.txt in the same folder but under different names. I looked through your code, but I can only find references to "input.txt." Is there logic that uses the entire folder?
  2. I left data.npy and vocab.pkl from the previous training sessions with old corpii (sp?). I should have removed those, right? I removed them for my next training session so that they could be created from the latest corpus.
  3. My corpus contains 1 gigantic chunk of dialogue, 13 whitepapers, 3 requirements docs, and 19 analyst reports. After reading your link from gwern, I realize I am making a huge mistake with my mixed corpus. I need to balance it or segment it. Right now every sentence has a newline. Which would prevent the RNN from learning adequate differences in style because it is trying to average everything which is not right. Using a large sequence doesn't help either if the sentence is less than 20 characters long. Man...I know so little about this. No question here, just thinking out loud. Thank you for responding!
gwern commented 7 years ago

I have finally managed to get waaaay better (actually really good) results by reducing the num_layers to 3. It seems to me that anything above 3 results in drastically worse results. I don't know why this is,

I was just reading this and was about to point this out: most people strongly recommend against using >3 layers in RNNs, and even 1 layer often works well, because, as you discovered, it works horribly. (For example, I think Karpathy mentions this in his RNN post.) It's either too much computation per character or it's a depth problem, the same way you can't train regular feedforward or CNN NNs more than 20 layers or so without problems like totally diverging on you unless you have special tricks like residual layers. (An RNN can be seen as an extremely deep feedforward NN, after all.) For RNNs, this problem hasn't been solved yet AFAIK; presumably someone will figure out at some point how to reuse residual layers or batchnorm or initialization in just the right way to fix it, but not yet.

vibss2397 commented 6 years ago

guys, i too am having unsatisfactory results while using lstm, i did not use this code but i looked through this and andrej's code before trying my implementation. my config is: BATCH_SIZE=25 LAYER_NUM=2 SEQ_LENGTH=75 HIDDEN_DIM=128 GENERATE_LEN=500 LEARNING_RATE=2e-3

my error is down to 4 but its still not constructing proper words or sentences The dataset i am using is the mini shakespeare one as of now


epoch 582 iteration 100 loss 4.223689

n nd t thedathearot ould the, ooonth, ife! it cath oubu is ands, TI t w: Une d, O: Boouth A: fof. buld LI cther nichiet, ad: Asithimownuch the h f poupen MOLI tharorilel, f o'de oulant, ses ousth, Thik

so i wanted to ask should i wait for the loss to go below 1 or should i change some parameters or is there a problem with my code