Open cinjon opened 1 week ago
Yes that is correct. You should let accelerate/the FSDP plugin handle everything unless you want "pure" bf16 training (which is not what you want here)
Thanks! How should I think about explicit casts in the huggingface repo then? For example, these in modeling_gemma:
https://github.com/huggingface/transformers/blob/1bd604d11c405dfb8b78bda4062d88fc75c17de0/src/transformers/models/gemma/modeling_gemma.py#L62 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L1087
Upcasts are generally fine, as they are just no-ops. Since everything is done under an autocast manager also (with how it all works), new tensors will be done in half/whatever precision just the original model weights won't be (notice how those were all done in the forward()
, which happens under autocast)
Hi, I'm wondering how I should be thinking of the mixed precision policies of these three packages together. My plugin is below. It works, but I don't think we're doing things right with the mixed_precision_policy.
In particular, we're setting bf16 in the FSDP pluging, we'also setting
--mixed_precision bf16
in the accelerate command, and we're settingself.model = model.to(torch.bfloat16)
in our train.py.I suspect that the last one is incorrect because it means that we'll lose out on the f32 precision. Is that right? Thanks!