Closed colibrisson closed 1 year ago
You're running pytorch-lightning master, right? Because the values supported by the latest stable release are ('64', '32', '16', 'bf16')
. I'd change them to the current values and pin PTL to >=1.9.0,<2.0
.
As I understand **-mixed
values are split before being passed to the Trainer
which indeed only accept 64, 32, b16 and 16.
Mixed is use to configure AMP in the plugin:
As per last stable release
Maybe a sanity check on the device should be done though.
The precision
argument in trainer only set the numerical precision: precision=16
is just half precision. To use AMP you need to provide a plugin to the trainer object. It's why I added 16-mixed
.
Maybe a sanity check on the device should be done though.
You are right. Should I limit mixed precision to ADA and the latter GPU? Maybe PL already has a fallback mechanism.
No I actually meant it would be great to check that the device used is CUDA (in case someone does something weird such as mixed on CPU)
Adding onto what I just said: actually mixed should be the default only if you use CUDA, no ?
Adding onto what I just said: actually mixed should be the default only if you use CUDA, no ?
You are right. I will add it.
The
precision
argument in trainer only set the numerical precision:precision=16
is just half precision. To use AMP you need to provide a plugin to the trainer object. It's why I added16-mixed
.
Not in stable. precision=16
is enough to enable AMP (pure half precision training isn't supported). Master/2.0 changes/will change the behavior to what you describe. See https://github.com/Lightning-AI/lightning/issues/9956#issuecomment-1207246337.
EDIT: Pure half precision training on master is still not possible. The semantics are explained here. There's no 16-true
value.
By the way mixed precision also works on CPU so it can be left enabled without CUDA as well. The question is if other accelerators like MPS support it so it might be best to filter it out for any device that isn't cuda/cpu.
I know but the semantic you are referring to is only implemented in master: https://github.com/Lightning-AI/lightning/pull/16783#issue-1587848352. With PL<=1.9, if you set precision=16, Cuda will issue the following warning:
Using 16bit None Automatic Mixed Precision (AMP)
It sounds like true half-precision to me.
As soon as PL 2.0 get released, we can get rid of: https://github.com/mittagessen/kraken/blob/8bf17e147d49be463afb26d041c53fe712f77849/kraken/lib/train.py#L80-L83
As I said, there's no true 16bit precision training in PTL, neither stable nor master. The plugin is completely unnecessary.
As I said, there's no true 16bit precision training in PTL, neither stable nor master. The plugin is completely unnecessary.
So why Cuda says "Using 16bit None Automatic Mixed Precision (AMP)"?
The blame for this specific print could be https://github.com/Lightning-AI/lightning/blame/5fafe10a2598bb455aa387f0f123b328b9be7177/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L745
We used to have to provide a AMP mode I think in lightning: https://pytorch-lightning.readthedocs.io/en/1.8.1/common/trainer.html#amp-backend
Try setting Trainer(amp_backend="native")
just to see if this is the issue :)
The format string is f"Using 16bit {self._amp_type_flag} Automatic Mixed Precision (AMP)"
. The None
refers to the AMP implementation flag that can optionally be given to the trainer (apex
or native
). It defaults to native
if none is given. It isn't a warning, just a info message.
My bad, I thought it was a Cuda warning.
Any suggestions?
If you could add it to the pretraining command as well I'd merge it today.
Thanks!
Add a
--precision
option toketos train
andketos segtrain
to choose the numerical precision to use during training as discussed in #451. It can be set to: '32', 'bf16', '16', '16-mixed', 'bf16-mixed'. The default is16-mixed
.