Open albertz opened 2 years ago
Ok, this uses our native-CTC, which we know has some non-determinism. Maybe that causes the big effect here?
@mmz33 @JackTemaki @Marvin84 @christophmluscher @ZhouW321 have you recently looked into this, or just tested it?
More examples:
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_oldspecaug4a_oldtwarp_attdrop01_aux48/recog_results_per_epoch/150
{"hub5e_00": 23.4, "hub5e_01": 15.2, "rt03s": 21.2}
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_oldspecaug4a_oldtwarp_attdrop01_aux4812/recog_results_per_epoch/150
{"hub5e_00": 22.9, "hub5e_01": 16.0, "rt03s": 21.3}
And:
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux48/recog_results_per_epoch/150
{"hub5e_00": 19.5, "hub5e_01": 15.7, "rt03s": 19.2}
output/exp_fs_base/conformer_pre10_d384_h6_blstmf2_specaug_attdrop01_posdrop01_aux4812/recog_results_per_epoch/150
{"hub5e_00": 20.9, "hub5e_01": 15.6, "rt03s": 19.7}
I just noticed that I had a bug in aux4812 and it was in fact the same as aux48.
have you recently looked into this, or just tested it?
Attention ASR Training has always been somewhat non-deterministic for me, but not as much as you are reporting here, probably not more than 0.3% deviation in the worst case. Hybrid deviates max ~0.1% WER I would say. Autoregressive TTS was much worse (in my master thesis it would range in some metric from 23 to 27), and does not even use CTC. So maybe CTC is not at fault.
Correction: The TTS experiments were with new seeds set on purpose, so we can exclude that here.
Correction: The TTS experiments were with new seeds set on purpose, so we can exclude that here.
So you say with same seed, they are very deterministic, or you never tested that?
I also use Conformer here, instead of BLSTM as I did in similar earlier determinism experiments. Maybe Conformer also leads to more non-determinism?
I recently observed some potential non-determinism in gradient accumulation and maybe other things which make use of the global train step (https://github.com/rwth-i6/returnn/issues/1205). The PR https://github.com/rwth-i6/returnn/pull/1206/ is supposed to fix that with new behavior version 15 but I don't know the results yet.
Actually, this is how I started this, because I wanted to see the difference between behavior version 14 and 15 and then I wanted to see how much noise to expect due to non-determinism, and the difference due to non-determinism is much higher, so I can't really tell if the PR #1206 with behavior version 15 makes any difference.
Correction: The TTS experiments were with new seeds set on purpose, so we can exclude that here.
So you say with same seed, they are very deterministic, or you never tested that?
Never tested.
Maybe this is also related to returnn-common in some way. These are pure returnn-common setups, i.e. this is also a new SpecAugment implementation, etc.
A starting point would be to check and update our get_non_deterministic_ops_from_graph
function. I think this is not up-to-date anymore.
I'm not really sure this is a bug, or what we can really do about. However, I open this now because I noticed again a quite huge effect:
These are all identical configs (same seeds), same RETURNN versions.
I looked at our
deterministic_train
option again, which so far only has an effect on theaggregation_method
ofcompute_gradients
. However, from looking at the code, I think that both variants should be deterministic (at least with a recent TF version), and the difference is rather in a performance (speed) vs memory tradeoff.We never really investigated this further, where the non-determinism comes from, and what we can do about it.
Also, from what I heard from Google, they don't seem to have such a problem, it's very deterministic. I think in one paper I even read that it is even deterministic up to every single bit but I'm not sure.