Non-deterministic output during decoding

spakhomov commented 6 years ago

I have trained a Transformer model for a speech recognition problem using the librispeech problem as the template. Training data includes librispeech but other datasets as well. The model appears to have trained ok. For intermittent evaluation, I am using a small development dataset (held out from training) of about 500 recordings of the same length as were used in the training. The loss function and accuracy metrics during training (t2t-trainer command) look OK - I stopped training after about 2M steps when the accuracy reached 92% on the DEV set and it stopped improving further. At that point, I exported the model with export.py and then tried testing with t2t-decoding using the latest checkpoint. I also tried the query.py (going against a TF server serving the exported model). I am experiencing two problems:

When I use the exact same DEV dataset that was used for evaluation during training with either t2t-decoding on a dataset or TF server query, I seem to get output that looks fairly reasonable but more around 70% accuracy (~30% CER) rather than the 92% accuracy (8% CER) that I see during training. Note that the same DEV set is used for periodic evaluations during training and during the decoding, so I am expecting to see roughly the same performance but there seems to be a large difference.
When I try to decode the same audio example several times in a row with the exact same parameters, the output is slightly different each time (overall on the entire dataset, the accuracy varies by 3-4% from run to run). My understanding is that non-deterministic output may be expected on GPUs but the differences between repeated runs I am observing seem too large for that.

Has anyone else experienced this?

I am new to T2T and I suspect that I must be doing something wrong or missing something obvious with the decoding but I can't figure out what might be the problem or even where to begin figuring this out - any tips for where to look would be very much appreciated!

…

TensorFlow and tensor2tensor versions

TF - tensorflow-gpu==1.6.0rc1 T2T - 1.5.3 Hardware = 2 GPUs - 1080 Ti

rsepassi commented 6 years ago

Evaluation uses teacher-forcing, so you should expect quality to degrade when you do autoregressive prediction.

For the determinism, ensure hparams.sampling_method="argmax".

Also try turning off beam search (beam_size=1). Would be helpful to know if the issue is in the greedy codepath or the beam search one (or both).

spakhomov commented 6 years ago

Thanks for the suggestions and for clarification regarding teacher-forcing - that makes sense. I tried adding the sampling_method="argmax" parameter and beam_size = 1 - doesn't seem to affect the non-determinism issue when I add these either separately or together. Here is the output from sending several times in a row the same short audio snippet to tensorflow-serving that's serving a t2t model. I modified query.py to add the hparams and print them out to confirm - see the console output after HPARAMS:. Maybe I am not setting these parameters correctly?

Setting the parameters in query.py:

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)
  usr_dir.import_usr_dir(FLAGS.t2t_usr_dir)
  problem = registry.problem(FLAGS.problem)
  hparams = tf.contrib.training.HParams(
      data_dir=os.path.expanduser(FLAGS.data_dir), sampling_method="argmax", beam_size=1)
  problem.get_hparams(hparams)

Console output: root@gpunode:/notebooks# python /t2t/user-modules/query.py --server=localhost:9000 --servable_name=saved_model --problem=spontspeech --t2t_usr_dir=/t2t/user-modules/ --data_dir=/t2t/eval/ --timeout_secs=60 --hparams="max_input_seq_length=187000,max_target_seq_length=350,max_length=187000,batch_size=450000,num_hidden_layers=4" /usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters INFO:tensorflow:Importing user module user-modules from path /t2t

HPARAMS: [('audio_add_delta_deltas', True), ('audio_dither', 3.051850947599719e-05), ('audio_frame_length', 25.0), ('audio_frame_step', 10.0), ('audio_keep_example_waveforms', False), ('audio_lower_edge_hertz', 20.0), ('audio_num_mel_bins', 80), ('audio_preemphasis', 0.97), ('audio_preproc_in_bottom', False), ('audio_sample_rate', 16000), ('audio_upper_edge_hertz', 8000.0), ('beam_size', 1), ('data_dir', '/t2t/eval/'), ('num_zeropad_frames', 250), ('sampling_method', 'argmax')]

    Hyp:
    OKAY I SEE WHEN

    Ref:
    I SEE WHAT

    Hyp:
    I SEE WHEN

    Ref:
    I SEE WHAT

    Hyp:
    OKAY I SEE WHEN

    Ref:
    I SEE WHAT

    Hyp:
    OKAY I SEE WHEN

    Ref:
    I SEE WHAT

    Hyp:
    HAVE TO SEE WHEN

    Ref:
    I SEE WHAT

xiangzizzz commented 6 years ago

@spakhomov hi~I met the very similar problem with you. I have trained a Transformer model for a speech recognition problem using the librispeech problem as the template. Training data is librispeech-960h. Dev set is librispeech-dev-clean. After I trained 2570000 steps with one 1080 Ti GPU, accuracy on dev set is 90.4%. That is, CER on dev set is 9.6%. However, when I use t2t-decode to do infer on the SAME dev set, CER is only 47.7%. I don't know why the gap between training and inference behavior is so big. @rsepassi said that is because evaluation uses teacher forcing and inference uses auto-regressive. But I don't think this reason causes so big gap(~38%). Because Scheduled Sampling in paper《STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS》can reduce the gap and WER relatively improved 7.8%. So, is anything wrong in my step? or is there other reason causing this gap?

manuel3265 commented 5 years ago

@spakhomov can I contact you? Do you have an email to write to you?

spakhomov commented 5 years ago

Sure - rxinform@umn.edu Serguei

Sent from my iPhone

On Apr 21, 2019, at 10:39 AM, manuel3265 notifications@github.com wrote:

@spakhomov can I contact you? Do you have an email to write to you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tensorflow / tensor2tensor

Non-deterministic output during decoding #666

TensorFlow and tensor2tensor versions