mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
751 stars 131 forks source link

add --precision option to ketos train and ketos segtrain #453

Closed colibrisson closed 1 year ago

colibrisson commented 1 year ago

Add a --precision option to ketos train and ketos segtrain to choose the numerical precision to use during training as discussed in #451. It can be set to: '32', 'bf16', '16', '16-mixed', 'bf16-mixed'. The default is 16-mixed.

mittagessen commented 1 year ago

You're running pytorch-lightning master, right? Because the values supported by the latest stable release are ('64', '32', '16', 'bf16'). I'd change them to the current values and pin PTL to >=1.9.0,<2.0.

PonteIneptique commented 1 year ago

As I understand **-mixed values are split before being passed to the Trainer which indeed only accept 64, 32, b16 and 16.

Mixed is use to configure AMP in the plugin:

https://github.com/mittagessen/kraken/blob/8bf17e147d49be463afb26d041c53fe712f77849/kraken/lib/train.py#L80-L83

As per last stable release

Maybe a sanity check on the device should be done though.

colibrisson commented 1 year ago

The precision argument in trainer only set the numerical precision: precision=16 is just half precision. To use AMP you need to provide a plugin to the trainer object. It's why I added 16-mixed.

colibrisson commented 1 year ago

Maybe a sanity check on the device should be done though.

You are right. Should I limit mixed precision to ADA and the latter GPU? Maybe PL already has a fallback mechanism.

PonteIneptique commented 1 year ago

No I actually meant it would be great to check that the device used is CUDA (in case someone does something weird such as mixed on CPU)

PonteIneptique commented 1 year ago

Adding onto what I just said: actually mixed should be the default only if you use CUDA, no ?

colibrisson commented 1 year ago

Adding onto what I just said: actually mixed should be the default only if you use CUDA, no ?

You are right. I will add it.

mittagessen commented 1 year ago

The precision argument in trainer only set the numerical precision: precision=16 is just half precision. To use AMP you need to provide a plugin to the trainer object. It's why I added 16-mixed.

Not in stable. precision=16 is enough to enable AMP (pure half precision training isn't supported). Master/2.0 changes/will change the behavior to what you describe. See https://github.com/Lightning-AI/lightning/issues/9956#issuecomment-1207246337.

EDIT: Pure half precision training on master is still not possible. The semantics are explained here. There's no 16-true value.

mittagessen commented 1 year ago

By the way mixed precision also works on CPU so it can be left enabled without CUDA as well. The question is if other accelerators like MPS support it so it might be best to filter it out for any device that isn't cuda/cpu.

colibrisson commented 1 year ago

I know but the semantic you are referring to is only implemented in master: https://github.com/Lightning-AI/lightning/pull/16783#issue-1587848352. With PL<=1.9, if you set precision=16, Cuda will issue the following warning:

Using 16bit None Automatic Mixed Precision (AMP)

It sounds like true half-precision to me.

colibrisson commented 1 year ago

As soon as PL 2.0 get released, we can get rid of: https://github.com/mittagessen/kraken/blob/8bf17e147d49be463afb26d041c53fe712f77849/kraken/lib/train.py#L80-L83

mittagessen commented 1 year ago

As I said, there's no true 16bit precision training in PTL, neither stable nor master. The plugin is completely unnecessary.

colibrisson commented 1 year ago

As I said, there's no true 16bit precision training in PTL, neither stable nor master. The plugin is completely unnecessary.

So why Cuda says "Using 16bit None Automatic Mixed Precision (AMP)"?

PonteIneptique commented 1 year ago

The blame for this specific print could be https://github.com/Lightning-AI/lightning/blame/5fafe10a2598bb455aa387f0f123b328b9be7177/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L745

We used to have to provide a AMP mode I think in lightning: https://pytorch-lightning.readthedocs.io/en/1.8.1/common/trainer.html#amp-backend

Try setting Trainer(amp_backend="native") just to see if this is the issue :)

mittagessen commented 1 year ago

The format string is f"Using 16bit {self._amp_type_flag} Automatic Mixed Precision (AMP)". The None refers to the AMP implementation flag that can optionally be given to the trainer (apex or native). It defaults to native if none is given. It isn't a warning, just a info message.

colibrisson commented 1 year ago

My bad, I thought it was a Cuda warning.

colibrisson commented 1 year ago

Any suggestions?

mittagessen commented 1 year ago

If you could add it to the pretraining command as well I'd merge it today.

mittagessen commented 1 year ago

Thanks!