HuggingFace models on CI are quite flaky

pytorch / torchdynamo

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.

BSD 3-Clause "New" or "Revised" License

997 stars 123 forks source link

HuggingFace models on CI are quite flaky #1985

Closed desertfire closed 1 year ago

desertfire commented 1 year ago

https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor

desertfire commented 1 year ago

Ok, bisecting points to https://github.com/pytorch/pytorch/pull/87492. https://github.com/pytorch/pytorch/pull/90746 reverts it.

To reproduce:

for i in {1..20}; do python benchmarks/dynamo/huggingface.py --training --accuracy --device cuda  --amp --only AlbertForQuestionAnswering --ci --backend aot_inductor_debug; done

Note the problem exists in aot_inductor_debug but not aot_eager, so likely a decomposition issue.

mlazos commented 1 year ago

I've narrowed the issue to the following decomps:

layernorm tanh_backwards tanh softmax

Resetting the RNG in the huggingface models removes the flakiness. I'm looking into what to do (if anything) about the decomps, since sofmax + ln are crucial for performance.

mlazos commented 1 year ago

Flakiness fixed by https://github.com/pytorch/pytorch/pull/90936