Closed ebsmothers closed 1 week ago
cc @gau-nernst @felipemello1
@yf225 have you seen this error before? It doesnt happen with torch nightlies or in the CI when we use backend="aot_eager"
we are compiling it per layer here: https://github.com/pytorch/torchtune/blob/82c232d0679ddef3fc419cdc18af758b98b4da05/recipes/full_finetune_single_device.py#L364
Wonder is it on a specific PyTorch version?
I tried to repro it on latest TorchTune main and PyTorch main, but couldn't repro this error 🤔
this is pytorch 2.4. Sorry i didnt make it clear @yf225
Hmm wonder would it be okay to require running on PyTorch nightly? I might need to look into this, but my worry is that if there is a fix we won't be able to retroactively add it into PyTorch 2.4 release 😞
Perhaps we can investigate if the old ways of doing compile (compile whole model model.compile()
and compile loss step torch.compile(_loss_step)
) work for pytorch 2.4? For compile loss step, I think last time I also only tested it with torch nightly...
Not supporting latest stable pytorch seems like a big deal. At least from my experience, apart from using stable versions for use cases requiring stability, stable versions are required to do reproducible experiments, since specific nightly versions will disappear.
Yeah we can definitely just version gate if the new ways we're compiling break things on 2.4. A bit of a UX hit but I agree that we do want to always at least support the latest stable version. Also we missed this because the compile backend for our tests is aot_eager (ref). Big thanks to @gau-nernst for catching both of these issues; I am (slowly) chipping away at debugging the CI coverage one on #1508
Repro:
results in error
Full stack trace