tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.52k stars 3.5k forks source link

No progress in English to Hindi Translation model - help identify the mistake! #1136

Open phildani7 opened 6 years ago

phildani7 commented 6 years ago

Description

I have put together 7 million lines of English-Hindi corpus. I have tried running the transformer_base, transformer_big and the universal_transformer models. All three of them are stuck at similar loss (0.3 - 0.4) and approximate BLEU score of 0.588 even after 1million+ steps. The decoder output is just repetition of one or more subwords.

Where is the mistake?

Note: When I ran the transformer_base model using just about 35,000 lines from the corpus, the model did a pretty good job of translating sentences similar to the ones fed in. I used a similar setup and command.

Environment information

https://github.com/phildani7/my_attempts/blob/master/big_en_hi_small.ipynb


OS: Ubuntu 18.04.1 LTS

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow-gpu==1.10.1

$ python -V
Python 3.6.5 :: Anaconda, Inc.

### For bugs: reproduction and error logs

# Steps to reproduce:
I am running "t2t-trainer" on a terminal so as to watch the output.

phil@philc:~$ t2t-trainer   --data_dir=$DATA_DIR   --t2t_usr_dir=./big_en_hi_small/trainer   --problem=big_en_hi_small   --model=transformer   --hparams_set=transformer_base   --output_dir=$OUTDIR   --worker_gpu=2   --train_steps=10000000
/home/phil/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
INFO:tensorflow:Importing user module trainer from path /home/phil/big_en_hi_small
WARNING:tensorflow:From /home/phil/anaconda3/lib/python3.6/site-packages/tensor2tensor/utils/trainer_lib.py:198: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=2
INFO:tensorflow:sync=False
WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
INFO:tensorflow:datashard_devices: ['gpu:0', 'gpu:1']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0', 'gpu:1']
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7008771320>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/home/phil/big_en_hi_small/trained_model', 'use_tpu': False, 't2t_device_info': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f7008771518>}
WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f7007ef5f28>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
INFO:tensorflow:Reading data files from /home/phil/t2t_data/big_en_hi_small-train*
INFO:tensorflow:partition: 0 num_data_files: 90
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
WARNING:tensorflow:From /home/phil/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/function.py:986: calling Graph.create_op (from tensorflow.python.framework.ops) with compute_shapes is deprecated and will be removed in a future version.
Instructions for updating:
Shapes are always computed; don't use the compute_shapes as it has no effect.
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 48353280
INFO:tensorflow:Using optimizer Adam
/home/phil/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-10-12 05:05:27.002156: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-10-12 05:05:27.189582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-12 05:05:27.190085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-10-12 05:05:27.263350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-12 05:05:27.263801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 10.57GiB
2018-10-12 05:05:27.265215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1
2018-10-12 05:05:28.795640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 05:05:28.795666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 
2018-10-12 05:05:28.795685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y 
2018-10-12 05:05:28.795690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N 
2018-10-12 05:05:28.796434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11586 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-10-12 05:05:28.881219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /home/phil/big_en_hi_small/trained_model/model.ckpt-1068000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1068000 into /home/phil/big_en_hi_small/trained_model/model.ckpt.
INFO:tensorflow:loss = 0.37076592, step = 1068000
INFO:tensorflow:global_step/sec: 2.82033
INFO:tensorflow:loss = 0.45989177, step = 1068100 (35.457 sec)
INFO:tensorflow:global_step/sec: 3.89435
INFO:tensorflow:loss = 0.6136617, step = 1068200 (25.678 sec)
INFO:tensorflow:global_step/sec: 3.84404
INFO:tensorflow:loss = 0.40534708, step = 1068300 (26.014 sec)
INFO:tensorflow:global_step/sec: 3.82018
INFO:tensorflow:loss = 0.42608535, step = 1068400 (26.177 sec)
INFO:tensorflow:global_step/sec: 3.83804
INFO:tensorflow:loss = 0.46848035, step = 1068500 (26.055 sec)
INFO:tensorflow:global_step/sec: 3.83651
INFO:tensorflow:loss = 0.4674313, step = 1068600 (26.066 sec)
INFO:tensorflow:global_step/sec: 3.81488
INFO:tensorflow:loss = 0.45866266, step = 1068700 (26.213 sec)
INFO:tensorflow:global_step/sec: 3.83821
INFO:tensorflow:loss = 0.41043162, step = 1068800 (26.054 sec)
INFO:tensorflow:global_step/sec: 3.80485
INFO:tensorflow:loss = 0.42918268, step = 1068900 (26.282 sec)
INFO:tensorflow:Saving checkpoints for 1069000 into /home/phil/big_en_hi_small/trained_model/model.ckpt.
INFO:tensorflow:Reading data files from /home/phil/t2t_data/big_en_hi_small-dev*
INFO:tensorflow:partition: 0 num_data_files: 10
WARNING:tensorflow:Padding the batch to ensure that remainder eval batches have a batch size divisible by the number of data shards. This may lead to incorrect metrics for non-zero-padded features, e.g. images. Use a single datashard (i.e. 1 GPU) in that case.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'eval'
INFO:tensorflow:Setting hparams.dropout to 0.0
INFO:tensorflow:Setting hparams.label_smoothing to 0.0
INFO:tensorflow:Setting hparams.layer_prepostprocess_dropout to 0.0
INFO:tensorflow:Setting hparams.symbol_dropout to 0.0
INFO:tensorflow:Setting hparams.attention_dropout to 0.0
INFO:tensorflow:Setting hparams.relu_dropout to 0.0
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_8268_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_8268_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_8268_512.top
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-11-23:40:40
INFO:tensorflow:Graph was finalized.
2018-10-12 05:10:40.881060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1
2018-10-12 05:10:40.881158: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 05:10:40.881165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 
2018-10-12 05:10:40.881182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y 
2018-10-12 05:10:40.881185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N 
2018-10-12 05:10:40.881359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11586 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-10-12 05:10:40.881481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /home/phil/big_en_hi_small/trained_model/model.ckpt-1069000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2018-10-11-23:42:24
INFO:tensorflow:Saving dict for global step 1069000: global_step = 1069000, loss = 0.9450921, metrics-big_en_hi_small/targets/accuracy = 0.7974321, metrics-big_en_hi_small/targets/accuracy_per_sequence = 0.068794124, metrics-big_en_hi_small/targets/accuracy_top5 = 0.9366053, metrics-big_en_hi_small/targets/approx_bleu_score = 0.58855116, metrics-big_en_hi_small/targets/neg_log_perplexity = -0.9441027, metrics-big_en_hi_small/targets/rouge_2_fscore = 0.6465996, metrics-big_en_hi_small/targets/rouge_L_fscore = 0.75816894
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1069000: 

# Error logs:
AsmaFaraji commented 5 years ago

I am training english to persian model and I have the same problem. could you overcome this problem?