Open adm995 opened 1 year ago
Thanks for reporting, will look into it.
half-precision (FP16) is not supported for most PyG layers, you need to set precision=32
instead when initializing pl.Trainer
.
Thank you so much @rusty1s .
Thank you for your reply @EdisonLeeeee .
I don't think that this is the case, in fact if you try the code i posted above only removing the DeepSpeed part from the pl.Trainer
, everything works fine.
from:
trainer = pl.Trainer(
accelerator="cuda" if torch.cuda.is_available() else "cpu",
devices=1,
precision=16,
max_epochs=max_epochs,
gradient_clip_val=1, # if disable_gradient_clipping is False else 0
strategy=DeepSpeedStrategy(
stage=3,
offload_optimizer=True,
offload_parameters=True,
),
)
to:
trainer = pl.Trainer(
accelerator="cuda" if torch.cuda.is_available() else "cpu",
devices=1,
precision=16,
max_epochs=max_epochs,
gradient_clip_val=1, # if disable_gradient_clipping is False else 0
)
Yep, removing the deepseed strategy is also a solution. But you can see that it actually uses fp32
during training rather fp16
even by setting precision=16
.
@EdisonLeeeee what do you mean with "actually uses fp32 during training
"?
Anyway, why using fp16
without DeepSpeed raises no error while just adding DeepSpeed throws
RuntimeError: expected scalar type
Sorry, I thought fp16
was not enabled if strategy=None
by default. TBH, I am not so familiar with PyTorch lightning, but I think the problem is attributed to Pytorch lightning as the DeepSpeed strategy is still in beta and might have some issues.
any updates on this?
π Describe the bug
Hi, i have a problem integrating DeepSpeed and PyG. In particular Setting 32 precision on Lightning Trainer on single GPU Quadro RTX 6000 everything works fine. Something similar to the issue in #2866, i guess. But, switching to 16 precision i have the following Traceback, (even calling
torch.Tensor.half()
on model, or on input, or both).Code to reproduce the error:
Environment
System info: