Why perplexity is increasing after using GRU unit?

ghost commented 6 years ago

Hi,

I tried training my system with GRU unit instead of LSTM. Surprisingly, ppl is increasing with every step. Check here:

  global step 100 lr 1 step-time 0.53s wps 7.34K ppl 2424133542.95 bleu 0.00
  global step 200 lr 1 step-time 0.32s wps 12.28K ppl 172575657.01 bleu 0.00
  global step 300 lr 1 step-time 0.32s wps 12.23K ppl 2312801.91 bleu 0.00
  global step 400 lr 1 step-time 0.32s wps 12.28K ppl 56363211.42 bleu 0.00
  global step 500 lr 1 step-time 0.32s wps 12.20K ppl 63951444.31 bleu 0.00
  global step 600 lr 1 step-time 0.32s wps 12.19K ppl 5311197239.04 bleu 0.00
  global step 700 lr 1 step-time 0.32s wps 12.12K ppl 258516364796.08 bleu 0.00
  global step 800 lr 1 step-time 0.32s wps 12.20K ppl 65653443631.91 bleu 0.00
  global step 900 lr 1 step-time 0.32s wps 12.19K ppl 175912961198871744.00 bleu 0.00
  global step 1000 lr 1 step-time 0.32s wps 12.25K ppl 122399524210719520.00 bleu 0.00

Major parameters used are:

attention=scaled_luong
  attention_architecture=standard
  batch_size=128
  beam_width=10
  decay_factor=0.9
  decay_steps=3000
  dropout=0.2
  encoder_type=uni
  forget_bias=1.0
  infer_batch_size=32
  init_op=uniform
  init_weight=0.1
  learning_rate=1.0
  log_device_placement=False
  max_gradient_norm=5.0
  max_train=0
  metrics=['bleu']
  num_buckets=5
  num_gpus=1
  num_layers=2
  num_residual_layers=0
  num_train_steps=25000
  num_translations_per_input=1
  num_units=512
  optimizer=sgd
  override_loaded_hparams=False
  pass_hidden_state=True
  random_seed=None
  residual=False
  share_vocab=False
  sos=<s>
  source_reverse=False
  src_vocab_size=78052
  start_decay_step=12000
  tgt_vocab_size=55534
  time_major=True
  unit_type=gru
  warmup_scheme=t2t
  warmup_steps=0

ryangmolina commented 6 years ago

I also have this problem.

luckcul commented 6 years ago

I have this problem too. The perplexity will be NaN after about 10000 steps.

guillaumekln commented 6 years ago

Did you try with a lower learning rate, like 0.1?

ryangmolina commented 6 years ago

Lowering the learning rate fixed the problem. I think Adam with learning rate of 0.001 is better for GRU.

HongHaiPV commented 6 years ago

@RyanMolina but the optimizer is SGD?. So the problem is the optimizer of GRU can't defined as SGD or it's an another bug?

tensorflow / nmt

Why perplexity is increasing after using GRU unit? #164