First and second predictions yield slightly different results when both jit and dropout are enabled

Curiously, the first two iterations of LeanTransformer on CPU may differ by a small amount (~1e-5) even with use_deterministic_algorithms(True)

Known facts:

it works consistently on torch==1.10.2, but inconsistently on torch 1.11.0
disabling JIT everywhere also fixes the issue, but it is unclear which exact jit causes the inconsistency
- to reproduce: LEAN_USE_JIT=0 pytest ./tests/test_modifications.py
there are configurations where everything works fine: dropout=0 or disabling custom autograd in BOTH ffn and attn
it works consistently with position_embedding_type='absolute', but inconsistently with 'rotary'
setting rotary cache beforehand seemingly does not solve the issue
example of a failed job without range(2): https://github.com/learning-at-home/lean_transformer/runs/5584141044?check_suite_focus=true

Hypothesis: the issue may be due to jit running non-optimized code on the first pass. This may have a different RNG behavior and/or different dtypes.

learning-at-home / lean_transformer