torch.compile and AMP - Githubissues

hkchengrex / Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation

https://hkchengrex.com/Cutie/

MIT License

579 stars 60 forks source link

torch.compile and AMP #56

Closed bhack closed 2 months ago

bhack commented 3 months ago

Decorating this his function @torch.compile https://github.com/hkchengrex/Cutie/blob/6d93ec14ae54a4b27f91e7bde7cadf2ff62a8ee7/cutie/model/cutie.py#L164

On training Is going to create a mismatch between half and float. Is amp correctly applied?

bhack commented 3 months ago

Please test it with pytorch-nighlty if you can as the last stable has some incompatibility to compile your model.

hkchengrex commented 3 months ago

Compiling the entire model would be messy due to the dynamic memory. I tried compiling individual components earlier but did not observe any real speed improvements -- I would be happy to revisit if the latest torch compile is better.

I'm not sure what the problem with AMP is. It should train without errors with AMP.

bhack commented 3 months ago

With nightly you have a better coverage also on the dynamic part but the main problem is on segment and it seems more related to mixing float and half somewhere.

bhack commented 3 months ago

Another slightly related question do you had a good GPU occupancy without compilation?

bhack commented 2 months ago

is the loss computation under AMP? https://github.com/hkchengrex/Cutie/blob/6d93ec14ae54a4b27f91e7bde7cadf2ff62a8ee7/cutie/model/losses.py#L62

hkchengrex commented 2 months ago

Another slightly related question do you had a good GPU occupancy without compilation?

I think so.

is the loss computation under AMP?

No.

bhack commented 2 months ago

Doesn't the loss need to be under autocast and the backward outside? https://pytorch.org/docs/stable/amp.html

hkchengrex commented 2 months ago

Backward is outside of AMP. While the exclusion is unintentional, I don't see a problem omitting the loss computation from AMP -- besides a slightly higher running time. The docs also say "should wrap only", not "must wrap".

bhack commented 2 months ago

Yes but I am only investigating cause when compiling emerges a half and float mixing operation in loss backward and it fails.

bhack commented 2 months ago

It is caused by the subsections where you have disabled amp. See https://github.com/pytorch/pytorch/issues/123560

bhack commented 2 months ago

Do you have some specific ablation/experimental issue to force these code sections to not use AMP? https://github.com/search?q=repo%3Ahkchengrex%2FCutie+amp.autocast%28enabled%3DFalse%29&type=code

hkchengrex commented 2 months ago

At first, we were training without any casting back to fp32, but training would occasionally go NaN. We then added these conversions in regions that I think would require a higher precision, and NaN becomes very rare. We did not individually ablate on each region due to the randomness of NaN occurrence.

bhack commented 2 months ago

I am also experiencing random NaN in grad_norm also with the the original higher precision sections. But I've not investigated where it is coming form.

hkchengrex commented 2 months ago

Some NaN in grad_norm is fine. The AMP scaler will skip past those.

bhack commented 2 months ago

Yes, In any case if you are interested aligning AMP sections e.g. with torch.cuda.amp.autocast(enabled=False): to True your full model is going to be compilable with pytorch nightly (e.g. decorating you whole model on the wrapper forward): https://github.com/hkchengrex/Cutie/blob/2ac7ac21d048e7ff8b2b033a084e0e4ea7b1216c/cutie/model/train_wrapper.py#L13-L25

But with then with the grad_norm NaN frequency I think it will be a problem without isolating the source.