learning-at-home / lean_transformer

Memory-efficient transformer. Work in progress.
MIT License
19 stars 3 forks source link

First and second predictions yield slightly different results when both jit and dropout are enabled #9

Open justheuristic opened 2 years ago

justheuristic commented 2 years ago

Curiously, the first two iterations of LeanTransformer on CPU may differ by a small amount (~1e-5) even with use_deterministic_algorithms(True)

To reproduce, go to this test and remove "for i in range 2" https://github.com/learning-at-home/lean_transformer/blob/e737a8ff9274e0ff1492dc76a62ecf36506f3e67/tests/test_modifications.py#L63-L68

Known facts:

Hypothesis: the issue may be due to jit running non-optimized code on the first pass. This may have a different RNG behavior and/or different dtypes.