Non-deterministic training

albertz commented 2 years ago

I'm not really sure this is a bug, or what we can really do about. However, I open this now because I noticed again a quite huge effect:

output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux48_bhv14/recog_results_per_epoch/040 
{"hub5e_00": 33.4, "hub5e_01": 30.8, "rt03s": 34.7}
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux48_bhv14_copy1/recog_results_per_epoch/040
{"hub5e_00": 33.3, "hub5e_01": 30.6, "rt03s": 34.8} 
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux48_bhv14_copy2/recog_results_per_epoch/040 
{"hub5e_00": 34.5, "hub5e_01": 31.1, "rt03s": 36.1}
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux48_bhv14_copy3/recog_results_per_epoch/040
{"hub5e_00": 32.1, "hub5e_01": 29.8, "rt03s": 33.5}

These are all identical configs (same seeds), same RETURNN versions.

I looked at our deterministic_train option again, which so far only has an effect on the aggregation_method of compute_gradients. However, from looking at the code, I think that both variants should be deterministic (at least with a recent TF version), and the difference is rather in a performance (speed) vs memory tradeoff.

We never really investigated this further, where the non-determinism comes from, and what we can do about it.

Also, from what I heard from Google, they don't seem to have such a problem, it's very deterministic. I think in one paper I even read that it is even deterministic up to every single bit but I'm not sure.

albertz commented 2 years ago

Ok, this uses our native-CTC, which we know has some non-determinism. Maybe that causes the big effect here?

albertz commented 2 years ago

@mmz33 @JackTemaki @Marvin84 @christophmluscher @ZhouW321 have you recently looked into this, or just tested it?

albertz commented 2 years ago

More examples:

output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_oldspecaug4a_oldtwarp_attdrop01_aux48/recog_results_per_epoch/150 
{"hub5e_00": 23.4, "hub5e_01": 15.2, "rt03s": 21.2}
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_oldspecaug4a_oldtwarp_attdrop01_aux4812/recog_results_per_epoch/150 
{"hub5e_00": 22.9, "hub5e_01": 16.0, "rt03s": 21.3}

And:

output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux48/recog_results_per_epoch/150
{"hub5e_00": 19.5, "hub5e_01": 15.7, "rt03s": 19.2}
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux4812/recog_results_per_epoch/150
{"hub5e_00": 20.9, "hub5e_01": 15.6, "rt03s": 19.7}

I just noticed that I had a bug in aux4812 and it was in fact the same as aux48.

JackTemaki commented 2 years ago

have you recently looked into this, or just tested it?

Attention ASR Training has always been somewhat non-deterministic for me, but not as much as you are reporting here, probably not more than 0.3% deviation in the worst case. Hybrid deviates max ~0.1% WER I would say. Autoregressive TTS was much worse (in my master thesis it would range in some metric from 23 to 27), and does not even use CTC. So maybe CTC is not at fault.

JackTemaki commented 2 years ago

Correction: The TTS experiments were with new seeds set on purpose, so we can exclude that here.

albertz commented 2 years ago

Correction: The TTS experiments were with new seeds set on purpose, so we can exclude that here.

So you say with same seed, they are very deterministic, or you never tested that?

albertz commented 2 years ago

I also use Conformer here, instead of BLSTM as I did in similar earlier determinism experiments. Maybe Conformer also leads to more non-determinism?

albertz commented 2 years ago

I recently observed some potential non-determinism in gradient accumulation and maybe other things which make use of the global train step (https://github.com/rwth-i6/returnn/issues/1205). The PR https://github.com/rwth-i6/returnn/pull/1206/ is supposed to fix that with new behavior version 15 but I don't know the results yet.

Actually, this is how I started this, because I wanted to see the difference between behavior version 14 and 15 and then I wanted to see how much noise to expect due to non-determinism, and the difference due to non-determinism is much higher, so I can't really tell if the PR #1206 with behavior version 15 makes any difference.

JackTemaki commented 2 years ago

Correction: The TTS experiments were with new seeds set on purpose, so we can exclude that here.

So you say with same seed, they are very deterministic, or you never tested that?

Never tested.

albertz commented 2 years ago

Maybe this is also related to returnn-common in some way. These are pure returnn-common setups, i.e. this is also a new SpecAugment implementation, etc.

albertz commented 2 years ago

A starting point would be to check and update our get_non_deterministic_ops_from_graph function. I think this is not up-to-date anymore.

rwth-i6 / returnn

Non-deterministic training #1210