tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.56k stars 3.51k forks source link

Question: Poor performence for librispeech task #1022

Open zh794390558 opened 6 years ago

zh794390558 commented 6 years ago

Description

I have used the T2T to test performance of librispeech before, and I remember the wer on test-clean data is almost 7%, but now the WER is so poor. Also the decoding result be same thing. Anybody can give some advice?

Environment information

OS: centos7

$ pip freeze | grep tensor
tensor2tensor==1.8.0
tensorboard==1.9.0
tensorflow-gpu==1.9.0

$ python -V
3.5.0

I used this option for test, but the result also poor.

#PROBLEM=librispeech
PROBLEM=librispeech_train_full_test_clean
MODEL=transformer
#HPARAMS=transformer_librispeech
HPARAMS=transformer_librispeech_tpu

. 02config.sh

 CUDA_VISIBLE_DEVICES=1 t2t-trainer \
    --model=$MODEL \
   --hparams_set=$HPARAMS \
    --problem=$PROBLEM \
    --train_steps=120000 \
    --eval_steps=3 \
    --local_eval_frequency=100 \
    --data_dir=$DATA_DIR \
    --export_dir=$EXPORT_DIR \
    --output_dir=$OUT_DIR

Training output:

NFO:tensorflow:Saving dict for global step 120000: global_step = 120000, loss = 1.373911, metrics-librispeech_train_full_test_clean/targets/accuracy = 0.6142857, metrics-librispeech_train_full_test_clean/targets/accuracy_per_sequence = 0.0, metrics-librispeech_train_full_test_clean/targets/accuracy_top5 = 0.9, metrics-librispeech_train_full_test_clean/targets/edit_distance = 0.3857143, metrics-librispeech_train_full_test_clean/targets/neg_log_perplexity = -1.3728632
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 120000: /nfs/project/t2t/libri/train/train_full_test_clean/model.ckpt-120000

Decoding output:

+ CUDA_VISIBLE_DEVICES=1
1321 + t2t-decoder --data_dir=/nfs/project/zhanghui/t2t/libri/data/dataset/librispeech/ --problem=librispeech_train_full_test_clean --model=transformer --hparams_set=transformer_librispeech_tpu --output_dir=/nfs/project/zhanghui/t2t/libri/train/train_full_test_clean --eva     l_use_test_set=True --decode_to_file=/nfs/project/zhanghui/t2t/libri/train/train_full_test_clean/infer
1322 WARNING:tensorflow:From /nfs/project/tools/env/tf1.9_py3.5/lib/python3.5/site-packages/tensor2tensor/utils/trainer_lib.py:198: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
1323 Instructions for updating:
1324 When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
1325 INFO:tensorflow:schedule=continuous_train_and_eval
1326 INFO:tensorflow:worker_gpu=1
1327 INFO:tensorflow:sync=False
1328 WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
1329 INFO:tensorflow:datashard_devices: ['gpu:0']
1330 INFO:tensorflow:caching_devices: None
1331 INFO:tensorflow:ps_devices: ['gpu:0']
1332 INFO:tensorflow:Using config: {'_evaluation_master': '', '_num_ps_replicas': 0, '_train_distribute': None, '_model_dir': '/nfs/project/zhanghui/t2t/libri/train/train_full_test_clean', 't2t_device_info': {'num_async_replicas': 1}, 'use_tpu': False, '_environment': 'lo     cal', '_num_worker_replicas': 0, '_log_step_count_steps': 100, '_session_config': gpu_options {
1333   per_process_gpu_memory_fraction: 0.95
1334 }
1335 allow_soft_placement: true
1336 graph_options {
1337   optimizer_options {
1338   }
1339 }
1340 , '_tf_config': gpu_options {
1341   per_process_gpu_memory_fraction: 1.0
1342 }
1343 , '_keep_checkpoint_max': 20, '_is_chief': True, '_save_summary_steps': 100, '_master': '', '_keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_save_checkpoints_steps': 1000, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fe     34d2e69e8>, '_tf_random_seed': None, '_device_fn': None, '_task_id': 0, '_save_checkpoints_secs': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe34d2e6ba8>}
1344 WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7fe34ca57598>) includes params argument, but params are not passed to Estimator.
1345 INFO:tensorflow:Performing local inference from dataset for librispeech_train_full_test_clean.
1346 INFO:tensorflow:Decoding 0
1347 INFO:tensorflow:Reading data files from /nfs/project/zhanghui/t2t/libri/data/dataset/librispeech/librispeech_clean-test*
1348 INFO:tensorflow:partition: 0 num_data_files: 1
1349 WARNING:tensorflow:Shapes are not fully defined. Assuming batch_size means tokens.
1350 INFO:tensorflow:Calling model_fn.
1351 INFO:tensorflow:Unsetting shared_embedding_and_softmax_weights.
1352 INFO:tensorflow:Setting T2TModel mode to 'infer'
1353 INFO:tensorflow:Setting hparams.relu_dropout to 0.0
1354 INFO:tensorflow:Setting hparams.label_smoothing to 0.0
1355 INFO:tensorflow:Setting hparams.symbol_dropout to 0.0
1356 INFO:tensorflow:Setting hparams.layer_prepostprocess_dropout to 0.0
1357 INFO:tensorflow:Setting hparams.dropout to 0.0
1358 INFO:tensorflow:Setting hparams.attention_dropout to 0.0
1359 INFO:tensorflow:Beam Decoding with beam size 4
1360 INFO:tensorflow:Done calling model_fn.
1361 INFO:tensorflow:Graph was finalized.
1362 2018-08-28 10:24:56.081642: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
1363 2018-08-28 10:24:56.353688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
1364 name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
1365 pciBusID: 0000:03:00.0
1366 totalMemory: 22.38GiB freeMemory: 22.21GiB
1367 2018-08-28 10:24:56.354271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
1368 2018-08-28 10:24:56.710902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
1369 2018-08-28 10:24:56.711607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0
1370 2018-08-28 10:24:56.711914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N
1371 2018-08-28 10:24:56.712764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21767 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:03:00.0, compute capabilit     y: 6.1)
1372 INFO:tensorflow:Restoring parameters from /nfs/project/zhanghui/t2t/libri/train/train_full_test_clean/model.ckpt-120000
1373 INFO:tensorflow:Running local_init_op.
1374 INFO:tensorflow:Done running local_init_op.
1375 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN THEREFORE THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN THEREFORE THEY ARE
1376 INFO:tensorflow:Inference results TARGET: HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOUR FATTENED SAUCE
1377 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY HAD BEEN TOLD THEM THAT THEY WERE ALL THEY HAD
1378 INFO:tensorflow:Inference results TARGET: STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
1379 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN TOLD THEM THEY WOULD HAVE BEEN TOLD THEM
1380 INFO:tensorflow:Inference results TARGET: AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
1381 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD NOT BEEN TOLD THEM
1382 INFO:tensorflow:Inference results TARGET: HELLO BERTIE ANY GOOD IN YOUR MIND
1383 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN
1384 INFO:tensorflow:Inference results TARGET: NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND
1385 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN THEREFORE THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN
1386 INFO:tensorflow:Inference results TARGET: THE MUSIC CAME NEARER AND HE RECALLED THE WORDS THE WORDS OF SHELLEY'S FRAGMENT UPON THE MOON WANDERING COMPANIONLESS PALE FOR WEARINESS
1387 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN THEREFORE THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN
1388 INFO:tensorflow:Inference results TARGET: THE DULL LIGHT FELL MORE FAINTLY UPON THE PAGE WHEREON ANOTHER EQUATION BEGAN TO UNFOLD ITSELF SLOWLY AND TO SPREAD ABROAD ITS WIDENING TAIL
1389 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN TOLD THEM
1390 INFO:tensorflow:Inference results TARGET: A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL
1391 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY WERE ALL THEY WERE ALL THEY HAD BEEN TOLD THEM AND THEY WERE ALL THEY HAD BEEN ABLE TO DO
1392 INFO:tensorflow:Inference results TARGET: THE CHAOS IN WHICH HIS ARDOUR EXTINGUISHED ITSELF WAS A COLD INDIFFERENT KNOWLEDGE OF HIMSELF
1393 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN THEREFORE THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN THEREFORE THEY ARE
1394 INFO:tensorflow:Inference results TARGET: AT MOST BY AN ALMS GIVEN TO A BEGGAR WHOSE BLESSING HE FLED FROM HE MIGHT HOPE WEARILY TO WIN FOR HIMSELF SOME MEASURE OF ACTUAL GRACE
1395 INFO:tensorflow:Inference results OUTPUT: AND THEN THEY WERE ALL THEY HAD BEEN TOLD THEM THEY WOULD HAVE BEEN ABLE TO DO
1396 INFO:tensorflow:Inference results TARGET: WELL NOW ENNIS I DECLARE YOU HAVE A HEAD AND SO HAS MY STICK

Dataset

librispeech-dev-00000-of-00001          librispeech-train-00032-of-00100        librispeech-train-00066-of-00100        librispeech_clean-dev-00000-of-00001    librispeech_clean-train-00032-of-00100  librispeech_clean-train-00066-of-00100
librispeech-test-00000-of-00001         librispeech-train-00033-of-00100        librispeech-train-00067-of-00100        librispeech_clean-test-00000-of-00001   librispeech_clean-train-00033-of-00100  librispeech_clean-train-00067-of-00100
librispeech-train-00000-of-00100        librispeech-train-00034-of-00100        librispeech-train-00068-of-00100        librispeech_clean-train-00000-of-00100  librispeech_clean-train-00034-of-00100  librispeech_clean-train-00068-of-00100
librispeech-train-00001-of-00100        librispeech-train-00035-of-00100        librispeech-train-00069-of-00100        librispeech_clean-train-00001-of-00100  librispeech_clean-train-00035-of-00100  librispeech_clean-train-00069-of-00100
librispeech-train-00002-of-00100        librispeech-train-00036-of-00100        librispeech-train-00070-of-00100        librispeech_clean-train-00002-of-00100  librispeech_clean-train-00036-of-00100  librispeech_clean-train-00070-of-00100
librispeech-train-00003-of-00100        librispeech-train-00037-of-00100        librispeech-train-00071-of-00100        librispeech_clean-train-00003-of-00100  librispeech_clean-train-00037-of-00100  librispeech_clean-train-00071-of-00100
librispeech-train-00004-of-00100        librispeech-train-00038-of-00100        librispeech-train-00072-of-00100        librispeech_clean-train-00004-of-00100  librispeech_clean-train-00038-of-00100  librispeech_clean-train-00072-of-00100
librispeech-train-00005-of-00100        librispeech-train-00039-of-00100        librispeech-train-00073-of-00100        librispeech_clean-train-00005-of-00100  librispeech_clean-train-00039-of-00100  librispeech_clean-train-00073-of-00100
librispeech-train-00006-of-00100        librispeech-train-00040-of-00100        librispeech-train-00074-of-00100        librispeech_clean-train-00006-of-00100  librispeech_clean-train-00040-of-00100  librispeech_clean-train-00074-of-00100
librispeech-train-00007-of-00100        librispeech-train-00041-of-00100        librispeech-train-00075-of-00100        librispeech_clean-train-00007-of-00100  librispeech_clean-train-00041-of-00100  librispeech_clean-train-00075-of-00100
librispeech-train-00008-of-00100        librispeech-train-00042-of-00100        librispeech-train-00076-of-00100        librispeech_clean-train-00008-of-00100  librispeech_clean-train-00042-of-00100  librispeech_clean-train-00076-of-00100
librispeech-train-00009-of-00100        librispeech-train-00043-of-00100        librispeech-train-00077-of-00100        librispeech_clean-train-00009-of-00100  librispeech_clean-train-00043-of-00100  librispeech_clean-train-00077-of-00100
librispeech-train-00010-of-00100        librispeech-train-00044-of-00100        librispeech-train-00078-of-00100        librispeech_clean-train-00010-of-00100  librispeech_clean-train-00044-of-00100  librispeech_clean-train-00078-of-00100
librispeech-train-00011-of-00100        librispeech-train-00045-of-00100        librispeech-train-00079-of-00100        librispeech_clean-train-00011-of-00100  librispeech_clean-train-00045-of-00100  librispeech_clean-train-00079-of-00100
librispeech-train-00012-of-00100        librispeech-train-00046-of-00100        librispeech-train-00080-of-00100        librispeech_clean-train-00012-of-00100  librispeech_clean-train-00046-of-00100  librispeech_clean-train-00080-of-00100
librispeech-train-00013-of-00100        librispeech-train-00047-of-00100        librispeech-train-00081-of-00100        librispeech_clean-train-00013-of-00100  librispeech_clean-train-00047-of-00100  librispeech_clean-train-00081-of-00100
librispeech-train-00014-of-00100        librispeech-train-00048-of-00100        librispeech-train-00082-of-00100        librispeech_clean-train-00014-of-00100  librispeech_clean-train-00048-of-00100  librispeech_clean-train-00082-of-00100
librispeech-train-00015-of-00100        librispeech-train-00049-of-00100        librispeech-train-00083-of-00100        librispeech_clean-train-00015-of-00100  librispeech_clean-train-00049-of-00100  librispeech_clean-train-00083-of-00100
librispeech-train-00016-of-00100        librispeech-train-00050-of-00100        librispeech-train-00084-of-00100        librispeech_clean-train-00016-of-00100  librispeech_clean-train-00050-of-00100  librispeech_clean-train-00084-of-00100
librispeech-train-00017-of-00100        librispeech-train-00051-of-00100        librispeech-train-00085-of-00100        librispeech_clean-train-00017-of-00100  librispeech_clean-train-00051-of-00100  librispeech_clean-train-00085-of-00100
librispeech-train-00018-of-00100        librispeech-train-00052-of-00100        librispeech-train-00086-of-00100        librispeech_clean-train-00018-of-00100  librispeech_clean-train-00052-of-00100  librispeech_clean-train-00086-of-00100
librispeech-train-00019-of-00100        librispeech-train-00053-of-00100        librispeech-train-00087-of-00100        librispeech_clean-train-00019-of-00100  librispeech_clean-train-00053-of-00100  librispeech_clean-train-00087-of-00100
librispeech-train-00020-of-00100        librispeech-train-00054-of-00100        librispeech-train-00088-of-00100        librispeech_clean-train-00020-of-00100  librispeech_clean-train-00054-of-00100  librispeech_clean-train-00088-of-00100
librispeech-train-00021-of-00100        librispeech-train-00055-of-00100        librispeech-train-00089-of-00100        librispeech_clean-train-00021-of-00100  librispeech_clean-train-00055-of-00100  librispeech_clean-train-00089-of-00100
librispeech-train-00022-of-00100        librispeech-train-00056-of-00100        librispeech-train-00090-of-00100        librispeech_clean-train-00022-of-00100  librispeech_clean-train-00056-of-00100  librispeech_clean-train-00090-of-00100
librispeech-train-00023-of-00100        librispeech-train-00057-of-00100        librispeech-train-00091-of-00100        librispeech_clean-train-00023-of-00100  librispeech_clean-train-00057-of-00100  librispeech_clean-train-00091-of-00100
librispeech-train-00024-of-00100        librispeech-train-00058-of-00100        librispeech-train-00092-of-00100        librispeech_clean-train-00024-of-00100  librispeech_clean-train-00058-of-00100  librispeech_clean-train-00092-of-00100
librispeech-train-00025-of-00100        librispeech-train-00059-of-00100        librispeech-train-00093-of-00100        librispeech_clean-train-00025-of-00100  librispeech_clean-train-00059-of-00100  librispeech_clean-train-00093-of-00100
librispeech-train-00026-of-00100        librispeech-train-00060-of-00100        librispeech-train-00094-of-00100        librispeech_clean-train-00026-of-00100  librispeech_clean-train-00060-of-00100  librispeech_clean-train-00094-of-00100
librispeech-train-00027-of-00100        librispeech-train-00061-of-00100        librispeech-train-00095-of-00100        librispeech_clean-train-00027-of-00100  librispeech_clean-train-00061-of-00100  librispeech_clean-train-00095-of-00100
librispeech-train-00028-of-00100        librispeech-train-00062-of-00100        librispeech-train-00096-of-00100        librispeech_clean-train-00028-of-00100  librispeech_clean-train-00062-of-00100  librispeech_clean-train-00096-of-00100
librispeech-train-00029-of-00100        librispeech-train-00063-of-00100        librispeech-train-00097-of-00100        librispeech_clean-train-00029-of-00100  librispeech_clean-train-00063-of-00100  librispeech_clean-train-00097-of-00100
librispeech-train-00030-of-00100        librispeech-train-00064-of-00100        librispeech-train-00098-of-00100        librispeech_clean-train-00030-of-00100  librispeech_clean-train-00064-of-00100  librispeech_clean-train-00098-of-00100
librispeech-train-00031-of-00100        librispeech-train-00065-of-00100        librispeech-train-00099-of-00100        librispeech_clean-train-00031-of-00100  librispeech_clean-train-00065-of-00100  librispeech_clean-train-00099-of-00100
Qiaoxl commented 6 years ago

Your loss is too high(loss = 1.373911). According to my experience, you won't get any acceptable predict with loss higher than 1. I never tried transformer_librispeech_tpu hparam. If the loss is still droping, you can try training more steps. BUT, after 120k steps with so high loss, I wouldn't expect any good result. Maybe the key reason is that, I suppose you were using one single local GPU, why you are using tpu hparams? Try to use transformer_librispeech_v1. After only 20k~30k steps you can get a good result.

zh794390558 commented 6 years ago

@Qiaoxl Thanks for your help. How number GPUs do you using, does this is a key point?

Qiaoxl commented 6 years ago

@zh794390558 It doesn't matter very much how many GPUs you are using. But with more GPUs your can use larger batch_size, which will help ( See Training Tips for the Transformer Model).

zackkui commented 6 years ago

How can you get librispeech_train_full_test_clean data with t2t-datagen?

Qiaoxl commented 6 years ago

How can you get librispeech_train_full_test_clean data with t2t-datagen?

Don't quite understand your question. with --problem=librispeech_train_full_test_clean on t2t-datagen, it will first download data if they were not found. If the program has problem with downloading. You can pre-download yourself. See detail in https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/librispeech.py#L26-L59

zackkui commented 6 years ago

@Qiaoxl Thanks for replying and I know how. I rewrite the librispeech.py whith a new problem name. But I forgot to change the file suffix ,e.g "librispeech".

shard_str = "-%05d" % shard if shard is not None else "" if mode == problem.DatasetSplit.TRAIN: path = os.path.join(data_dir, "librispeech") suffix = "train" elif mode in [problem.DatasetSplit.EVAL, tf.estimator.ModeKeys.PREDICT]: path = os.path.join(data_dir, "librispeech_clean") suffix = "dev" else: assert mode == problem.DatasetSplit.TEST path = os.path.join(data_dir, "librispeech_clean") suffix = "test"

return "%s-%s%s*" % (path, suffix, shard_str)

Qiaoxl commented 6 years ago

@zackkui
If you need to add a new librispeech problem:

Now you get a new librispeech problem.

If you added new datasets in _LIBRISPEECH_TRAIN_DATASETS, _LIBRISPEECH_DEV_DATASETS or _LIBRISPEECH_TEST_DATASETS, you also need to change the slices of origin problems(LibrispeechTrainFullTestClean, LibrispeechCleanSmall, LibrispeechClean, LibrispeechNoisy) to make sure their datasets unchanged.

zackkui commented 6 years ago

It's a good way. Thanks a lot again!

mjhanphd commented 6 years ago

Have you resolved this issue? I implemented ASR transformer myself based on the official transformer code and I'm facing the same issue as you. During training, it outputs quite appropriate logits but in test it spit out just wrong sentence, the same sentence for every different audios ... I have no idea why this happens at all.

anravich102 commented 5 years ago

Any progress on this issue?

zh794390558 commented 5 years ago

No progress on this. I do not have time and machine to test this.

Qiaoxl commented 5 years ago

Any progress on this issue?

PROBLEM=librispeech_train_full_test_clean
MODEL=transformer
HPARAMS_SET=transformer_librispeech_v1

And the training loss:

20180629_174430

I didn't test the WER, because I just need the trained model for transfer learning. But the result should be good.

This issue ought to be closed.

chengmengli06 commented 5 years ago

@Qiaoxl , I do not know how you get the problem start to train, there are many weired settings:

  1. the batch_size is set to 6000000, which is too large https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py#L2670
  2. the data of the problem could not be generated, due to the following function is not implemented https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/librispeech.py#L196
  3. the name of the dataset is not correct, "librispeech" and "librispeech_clean" should be changed to self.name https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/librispeech.py#L217

After fixing the problem, I could make data: PYTHONPATH=. ./tensor2tensor/bin/t2t-datagen --data_dir=data/librispeech/ --tmp_dir=data/librispeech/ --problem=librispeech_train_full_test_clean

And then I start the problem training: CUDA_VISIBLE_DEVICES=4,5,6,7 PYTHONPATH=. nohup ./tensor2tensor/bin/t2t-trainer --model=transformer --hparams_set=transformer_librispeech_v2 --problem=librispeech_train_full_test_clean --train_steps=120000 --local_eval_frequency=5000 --eval_steps=50 --data_dir data/librispeech/ --output_dir=./librispeech_output --worker_gpu=4 the training process converges very slow.

image

stylon commented 4 years ago

I have the exact same problem. No matter what training set I use (100h or 960h) or what batch_size or what optimizer setup, I never get those nice "2 step" loss curves for librispeech. The "knee" around 20-30k steps doesn't occur, all plots show just a straight exponential decay.

Needless to say that inference is completely useless at the final checkpoint barely below 1.0, outputting just the same character or nothing in my case.

=== My setup ===

elementary OS (Ubuntu 18.04)

NVIDIA driver 440.64.00 (GTX1080 8GB)
CUDA 10.0 (10.0.130-1)
CUDNN 7.6.5.32-1+cuda10.0

$ pip3 freeze|grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow-datasets==1.0.2
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0

=== My train.sh ===

export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:$LD_LIBRARY_PATH
export TF_FORCE_GPU_ALLOW_GROWTH=true
~/.local/bin/t2t-trainer \
    --generate_data \
    --problem=librispeech_clean_small \
    --model=transformer \
    --hparams_set=transformer_librispeech \
    --hparams="batch_size=2100000" \
    --train_steps=500000 \
    --eval_steps=3 \
    --local_eval_frequency=100 \
    --worker_gpu=1 \
    --data_dir=./data \
    --output_dir=./output-train \
    --tmp_dir=./tmp

I used TF_FORCE_GPU_ALLOW_GROWTH to avoid sudden OOM issues as I use the same machine occasionally for light-weight desktop tasks. I also tried different hparams_sets (transformer_librispeech_v1), but even that didn't change the trend.

@Qiaoxl: I think it would be useful to share the environment that you used to generate your plots.

Any indicative help would be useful. I've also found ticket #1245 mentioning the same issue, but only found a convergence plot there that resulted in closing the ticket. I assumed the setup would be a no-brainer, but even setting up CUDA correctly was a major hazzle and now those training scripts also don't go in the expected direction. I would like to understand what exactly I'm doing wrong here.

stylon commented 4 years ago

By "accident" I left a training running over a couple of days and I finally got some convergence, but the loss only started to drop after some 140k steps. That's by no means in the range of reported 20-30k steps here. And I even used a smaller batch_size than the default setting of 6M. And with larger batch sizes I would expect only slower, but less noisy convergence. The loss also never reached those low values we see in Qiaoxl's plots, even after some 300k steps.

I re-ran the same training script (from scratch) and this time it didn't even converge to lower than 1.0 even after 300k steps :-(. Very stochastic outcome and quite annoying to waste that much energy on it. Is there a trick that I'm missing? Is it my setup? I'm using a single GTX1080 (non-Ti) with 8GB, is it due to limited hardware?