Closed bhack closed 2 months ago
Please test it with pytorch-nighlty
if you can as the last stable has some incompatibility to compile your model.
Compiling the entire model would be messy due to the dynamic memory. I tried compiling individual components earlier but did not observe any real speed improvements -- I would be happy to revisit if the latest torch compile is better.
I'm not sure what the problem with AMP is. It should train without errors with AMP.
With nightly you have a better coverage also on the dynamic part but the main problem is on segment
and it seems more related to mixing float and half somewhere.
Another slightly related question do you had a good GPU occupancy without compilation?
is the loss computation under AMP? https://github.com/hkchengrex/Cutie/blob/6d93ec14ae54a4b27f91e7bde7cadf2ff62a8ee7/cutie/model/losses.py#L62
Another slightly related question do you had a good GPU occupancy without compilation?
I think so.
is the loss computation under AMP?
No.
Doesn't the loss need to be under autocast and the backward outside? https://pytorch.org/docs/stable/amp.html
Backward is outside of AMP. While the exclusion is unintentional, I don't see a problem omitting the loss computation from AMP -- besides a slightly higher running time. The docs also say "should wrap only", not "must wrap".
Yes but I am only investigating cause when compiling emerges a half and float mixing operation in loss backward and it fails.
It is caused by the subsections where you have disabled amp. See https://github.com/pytorch/pytorch/issues/123560
Do you have some specific ablation/experimental issue to force these code sections to not use AMP? https://github.com/search?q=repo%3Ahkchengrex%2FCutie+amp.autocast%28enabled%3DFalse%29&type=code
At first, we were training without any casting back to fp32, but training would occasionally go NaN
. We then added these conversions in regions that I think would require a higher precision, and NaN
becomes very rare. We did not individually ablate on each region due to the randomness of NaN occurrence.
I am also experiencing random NaN in grad_norm also with the the original higher precision sections. But I've not investigated where it is coming form.
Some NaN in grad_norm is fine. The AMP scaler will skip past those.
Yes, In any case if you are interested aligning AMP sections e.g. with torch.cuda.amp.autocast(enabled=False):
to True
your full model is going to be compilable with pytorch nightly (e.g. decorating you whole model on the wrapper forward
):
https://github.com/hkchengrex/Cutie/blob/2ac7ac21d048e7ff8b2b033a084e0e4ea7b1216c/cutie/model/train_wrapper.py#L13-L25
But with then with the grad_norm NaN frequency I think it will be a problem without isolating the source.
Decorating this his function
@torch.compile
https://github.com/hkchengrex/Cutie/blob/6d93ec14ae54a4b27f91e7bde7cadf2ff62a8ee7/cutie/model/cutie.py#L164On training Is going to create a mismatch between
half
andfloat
. Is amp correctly applied?