No progress in English to Hindi Translation model - help identify the mistake!

Description

I have put together 7 million lines of English-Hindi corpus. I have tried running the transformer_base, transformer_big and the universal_transformer models. All three of them are stuck at similar loss (0.3 - 0.4) and approximate BLEU score of 0.588 even after 1million+ steps. The decoder output is just repetition of one or more subwords.
Where is the mistake?
Note: When I ran the transformer_base model using just about 35,000 lines from the corpus, the model did a pretty good job of translating sentences similar to the ones fed in. I used a similar setup and command.
Environment information

https://github.com/phildani7/my_attempts/blob/master/big_en_hi_small.ipynb

OS: Ubuntu 18.04.1 LTS

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow-gpu==1.10.1

$ python -V
Python 3.6.5 :: Anaconda, Inc.

### For bugs: reproduction and error logs

# Steps to reproduce:
I am running "t2t-trainer" on a terminal so as to watch the output.

phil@philc:~$ t2t-trainer   --data_dir=$DATA_DIR   --t2t_usr_dir=./big_en_hi_small/trainer   --problem=big_en_hi_small   --model=transformer   --hparams_set=transformer_base   --output_dir=$OUTDIR   --worker_gpu=2   --train_steps=10000000
/home/phil/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
INFO:tensorflow:Importing user module trainer from path /home/phil/big_en_hi_small
WARNING:tensorflow:From /home/phil/anaconda3/lib/python3.6/site-packages/tensor2tensor/utils/trainer_lib.py:198: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=2
INFO:tensorflow:sync=False
WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
INFO:tensorflow:datashard_devices: ['gpu:0', 'gpu:1']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0', 'gpu:1']
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7008771320>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/home/phil/big_en_hi_small/trained_model', 'use_tpu': False, 't2t_device_info': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f7008771518>}
WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f7007ef5f28>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
INFO:tensorflow:Reading data files from /home/phil/t2t_data/big_en_hi_small-train*
INFO:tensorflow:partition: 0 num_data_files: 90
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
WARNING:tensorflow:From /home/phil/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/function.py:986: calling Graph.create_op (from tensorflow.python.framework.ops) with compute_shapes is deprecated and will be removed in a future version.
Instructions for updating:
Shapes are always computed; don't use the compute_shapes as it has no effect.
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 48353280
INFO:tensorflow:Using optimizer Adam
/home/phil/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-10-12 05:05:27.002156: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-10-12 05:05:27.189582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-12 05:05:27.190085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-10-12 05:05:27.263350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-12 05:05:27.263801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 10.57GiB
2018-10-12 05:05:27.265215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1
2018-10-12 05:05:28.795640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 05:05:28.795666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 
2018-10-12 05:05:28.795685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y 
2018-10-12 05:05:28.795690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N 
2018-10-12 05:05:28.796434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11586 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-10-12 05:05:28.881219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /home/phil/big_en_hi_small/trained_model/model.ckpt-1068000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1068000 into /home/phil/big_en_hi_small/trained_model/model.ckpt.
INFO:tensorflow:loss = 0.37076592, step = 1068000
INFO:tensorflow:global_step/sec: 2.82033
INFO:tensorflow:loss = 0.45989177, step = 1068100 (35.457 sec)
INFO:tensorflow:global_step/sec: 3.89435
INFO:tensorflow:loss = 0.6136617, step = 1068200 (25.678 sec)
INFO:tensorflow:global_step/sec: 3.84404
INFO:tensorflow:loss = 0.40534708, step = 1068300 (26.014 sec)
INFO:tensorflow:global_step/sec: 3.82018
INFO:tensorflow:loss = 0.42608535, step = 1068400 (26.177 sec)
INFO:tensorflow:global_step/sec: 3.83804
INFO:tensorflow:loss = 0.46848035, step = 1068500 (26.055 sec)
INFO:tensorflow:global_step/sec: 3.83651
INFO:tensorflow:loss = 0.4674313, step = 1068600 (26.066 sec)
INFO:tensorflow:global_step/sec: 3.81488
INFO:tensorflow:loss = 0.45866266, step = 1068700 (26.213 sec)
INFO:tensorflow:global_step/sec: 3.83821
INFO:tensorflow:loss = 0.41043162, step = 1068800 (26.054 sec)
INFO:tensorflow:global_step/sec: 3.80485
INFO:tensorflow:loss = 0.42918268, step = 1068900 (26.282 sec)
INFO:tensorflow:Saving checkpoints for 1069000 into /home/phil/big_en_hi_small/trained_model/model.ckpt.
INFO:tensorflow:Reading data files from /home/phil/t2t_data/big_en_hi_small-dev*
INFO:tensorflow:partition: 0 num_data_files: 10
WARNING:tensorflow:Padding the batch to ensure that remainder eval batches have a batch size divisible by the number of data shards. This may lead to incorrect metrics for non-zero-padded features, e.g. images. Use a single datashard (i.e. 1 GPU) in that case.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'eval'
INFO:tensorflow:Setting hparams.dropout to 0.0
INFO:tensorflow:Setting hparams.label_smoothing to 0.0
INFO:tensorflow:Setting hparams.layer_prepostprocess_dropout to 0.0
INFO:tensorflow:Setting hparams.symbol_dropout to 0.0
INFO:tensorflow:Setting hparams.attention_dropout to 0.0
INFO:tensorflow:Setting hparams.relu_dropout to 0.0
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-11-23:40:40
INFO:tensorflow:Graph was finalized.
2018-10-12 05:10:40.881060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1
2018-10-12 05:10:40.881158: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 05:10:40.881165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 
2018-10-12 05:10:40.881182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y 
2018-10-12 05:10:40.881185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N 
2018-10-12 05:10:40.881359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11586 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-10-12 05:10:40.881481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /home/phil/big_en_hi_small/trained_model/model.ckpt-1069000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2018-10-11-23:42:24
INFO:tensorflow:Saving dict for global step 1069000: global_step = 1069000, loss = 0.9450921, metrics-big_en_hi_small/targets/accuracy = 0.7974321, metrics-big_en_hi_small/targets/accuracy_per_sequence = 0.068794124, metrics-big_en_hi_small/targets/accuracy_top5 = 0.9366053, metrics-big_en_hi_small/targets/approx_bleu_score = 0.58855116, metrics-big_en_hi_small/targets/neg_log_perplexity = -0.9441027, metrics-big_en_hi_small/targets/rouge_2_fscore = 0.6465996, metrics-big_en_hi_small/targets/rouge_L_fscore = 0.75816894
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1069000: 

# Error logs:
tensorflow / tensor2tensor

No progress in English to Hindi Translation model - help identify the mistake! #1136

Description

Environment information