TEDLIUM2 cannot converge in training

meixitu commented 6 years ago

Have I written custom code (as opposed to running examples on an unmodified clone of the repository): NO
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (our builds, or upstream TensorFlow): use requirements.txt
TensorFlow version (use command below):tensorflow-GPU 1.6.0,
Python version: 3.6.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:CUDA9.0/CUDNN 7.1.2
GPU model and memory:GTX1080TI, 9 GPUS
Exact command to reproduce: ./bin/run_ted.sh

I download this code at this week from master I tried commonVoice dataset and tedlium2 dataset, both of them can't converge

I use this run-ted.sh to run the ted file, this is the final command.

this is the log, it seems loss is not reduced.

reuben commented 6 years ago

The bin/run-* scripts were tested at some point in the far past. Since then the architecture has gone through major changes and it's quite likely that the hyperparameters no longer make sense. I don't think we'll be training on individual datasets just for the sake of maintaining those files, so maybe we should remove them to avoid the misdirection, and document only the hyperparameters we actually test (e.g. for the release models). @kdavis-mozilla @lissyx what do you think?

meixitu commented 6 years ago

I think I make a mistake. TED dataset is converged after 10 epoch training. because in the training process, WER, src, res are not showed. So I guess it is not converged because the loss is not reduced significantly.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mozilla / DeepSpeech

TEDLIUM2 cannot converge in training #1509