Open spakhomov opened 6 years ago
Evaluation uses teacher-forcing, so you should expect quality to degrade when you do autoregressive prediction.
For the determinism, ensure hparams.sampling_method="argmax"
.
Also try turning off beam search (beam_size=1
). Would be helpful to know if the issue is in the greedy codepath or the beam search one (or both).
Thanks for the suggestions and for clarification regarding teacher-forcing - that makes sense. I tried adding the sampling_method="argmax" parameter and beam_size = 1 - doesn't seem to affect the non-determinism issue when I add these either separately or together. Here is the output from sending several times in a row the same short audio snippet to tensorflow-serving that's serving a t2t model. I modified query.py to add the hparams and print them out to confirm - see the console output after HPARAMS:. Maybe I am not setting these parameters correctly?
Setting the parameters in query.py:
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
usr_dir.import_usr_dir(FLAGS.t2t_usr_dir)
problem = registry.problem(FLAGS.problem)
hparams = tf.contrib.training.HParams(
data_dir=os.path.expanduser(FLAGS.data_dir), sampling_method="argmax", beam_size=1)
problem.get_hparams(hparams)
Console output:
root@gpunode:/notebooks# python /t2t/user-modules/query.py --server=localhost:9000 --servable_name=saved_model --problem=spontspeech --t2t_usr_dir=/t2t/user-modules/ --data_dir=/t2t/eval/ --timeout_secs=60 --hparams="max_input_seq_length=187000,max_target_seq_length=350,max_length=187000,batch_size=450000,num_hidden_layers=4"
/usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float
to np.floating
is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type
.
from ._conv import register_converters as _register_converters
INFO:tensorflow:Importing user module user-modules from path /t2t
HPARAMS: [('audio_add_delta_deltas', True), ('audio_dither', 3.051850947599719e-05), ('audio_frame_length', 25.0), ('audio_frame_step', 10.0), ('audio_keep_example_waveforms', False), ('audio_lower_edge_hertz', 20.0), ('audio_num_mel_bins', 80), ('audio_preemphasis', 0.97), ('audio_preproc_in_bottom', False), ('audio_sample_rate', 16000), ('audio_upper_edge_hertz', 8000.0), ('beam_size', 1), ('data_dir', '/t2t/eval/'), ('num_zeropad_frames', 250), ('sampling_method', 'argmax')]
Hyp:
OKAY I SEE WHEN
Ref:
I SEE WHAT
Hyp:
I SEE WHEN
Ref:
I SEE WHAT
Hyp:
OKAY I SEE WHEN
Ref:
I SEE WHAT
Hyp:
OKAY I SEE WHEN
Ref:
I SEE WHAT
Hyp:
HAVE TO SEE WHEN
Ref:
I SEE WHAT
@spakhomov hi~I met the very similar problem with you. I have trained a Transformer model for a speech recognition problem using the librispeech problem as the template. Training data is librispeech-960h. Dev set is librispeech-dev-clean. After I trained 2570000 steps with one 1080 Ti GPU, accuracy on dev set is 90.4%. That is, CER on dev set is 9.6%. However, when I use t2t-decode to do infer on the SAME dev set, CER is only 47.7%. I don't know why the gap between training and inference behavior is so big. @rsepassi said that is because evaluation uses teacher forcing and inference uses auto-regressive. But I don't think this reason causes so big gap(~38%). Because Scheduled Sampling in paper《STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS》can reduce the gap and WER relatively improved 7.8%. So, is anything wrong in my step? or is there other reason causing this gap?
@spakhomov can I contact you? Do you have an email to write to you?
Sure - rxinform@umn.edu Serguei
Sent from my iPhone
On Apr 21, 2019, at 10:39 AM, manuel3265 notifications@github.com wrote:
@spakhomov can I contact you? Do you have an email to write to you?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I have trained a Transformer model for a speech recognition problem using the librispeech problem as the template. Training data includes librispeech but other datasets as well. The model appears to have trained ok. For intermittent evaluation, I am using a small development dataset (held out from training) of about 500 recordings of the same length as were used in the training. The loss function and accuracy metrics during training (t2t-trainer command) look OK - I stopped training after about 2M steps when the accuracy reached 92% on the DEV set and it stopped improving further. At that point, I exported the model with export.py and then tried testing with t2t-decoding using the latest checkpoint. I also tried the query.py (going against a TF server serving the exported model). I am experiencing two problems:
When I use the exact same DEV dataset that was used for evaluation during training with either t2t-decoding on a dataset or TF server query, I seem to get output that looks fairly reasonable but more around 70% accuracy (~30% CER) rather than the 92% accuracy (8% CER) that I see during training. Note that the same DEV set is used for periodic evaluations during training and during the decoding, so I am expecting to see roughly the same performance but there seems to be a large difference.
When I try to decode the same audio example several times in a row with the exact same parameters, the output is slightly different each time (overall on the entire dataset, the accuracy varies by 3-4% from run to run). My understanding is that non-deterministic output may be expected on GPUs but the differences between repeated runs I am observing seem too large for that.
Has anyone else experienced this?
I am new to T2T and I suspect that I must be doing something wrong or missing something obvious with the decoding but I can't figure out what might be the problem or even where to begin figuring this out - any tips for where to look would be very much appreciated!
TensorFlow and tensor2tensor versions
TF - tensorflow-gpu==1.6.0rc1 T2T - 1.5.3 Hardware = 2 GPUs - 1080 Ti