Reduce LRonPlateau crashes

PonteIneptique commented 1 year ago

When using --lr reduceonplateau, throws:

MisconfigurationException: ReduceLROnPlateau conditioned on metric val_metric which is not available. Available metrics are: []. Condition can be set using monitor key in lr scheduler dict

mittagessen commented 1 year ago

Seems that pytorch-lightning has changed some logging behavior again.... I'll look into it.

PonteIneptique commented 1 year ago

@mittagessen I am really worried something happened in the dependencies.

I have been hitting my head against a wall for the past two days because the model we have dropped like ~8 % on dev, staying at 82%. I rolled back to the deps using a freeze of a previous install, and re-reached 88% or so (Still training).

It might have been only --reduceonplateau, but this feels wrong no matter what, I just don't know what's going on.

mittagessen commented 1 year ago

The previous install was 4.2.0 or another dev branch? reduceonplateau shouldn't have that much of an impact (I personally don't use it because I never saw consistent improvements with it) but a broken pytorch/cuda/cudnn/pytorch-lightning installation can conceivably produce degradation like that.

PonteIneptique commented 1 year ago

Same arrow
One is 4.1.2, with dependencies from few months ago
One is more or less master with dependencies from a week ago (more or less)

Requirements 1 (82%)

aiohttp==3.8.3
aiosignal==1.3.1
albumentations==1.3.0
async-timeout==4.0.2
attrs==22.2.0
certifi==2022.12.7
charset-normalizer==2.1.1
chocomufin==0.1.10
click==8.1.3
commonmark==0.9.1
coremltools==6.1
frozenlist==1.3.3
fsspec==2022.11.0
idna==3.4
imageio==2.24.0
importlib-resources==5.10.2
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.17.3
kraken==4.2.0
lightning-utilities==0.5.0
lxml==4.6.3
MarkupSafe==2.1.1
mpmath==1.2.1
mufidecode==0.1.0
multidict==6.0.4
networkx==3.0
numpy==1.23.1
opencv-python-headless==4.7.0.68
packaging==23.0
Pillow==9.4.0
pkgutil_resolve_name==1.3.10
protobuf==3.20.1
pyarrow==10.0.1
Pygments==2.14.0
pyrsistent==0.19.3
python-bidi==0.4.2
pytorch-lightning==1.8.6
PyWavelets==1.4.1
PyYAML==6.0
qudida==0.0.4
regex==2022.4.24
requests==2.28.1
rich==13.0.1
scikit-image==0.19.2
scikit-learn==1.2.1
scipy==1.10.0
shapely==2.0.0
six==1.16.0
sympy==1.11.1
tabulate==0.8.9
tensorboardX==2.5.1
threadpoolctl==3.1.0
tifffile==2022.10.10
torch==1.11.0+cu113
torchmetrics==0.11.0
torchvision==0.12.0+cu113
tqdm==4.61.1
typing_extensions==4.4.0
Unidecode==1.2.0
urllib3==1.26.13
yarl==1.8.2
zipp==3.11.0

Requirements 2 (88% en going)

absl-py==1.2.0
aiohttp==3.8.1
aiosignal==1.2.0
albumentations==1.3.0
asttokens==2.0.8
async-timeout==4.0.2
attrs==22.1.0
backcall==0.2.0
bleach==5.0.1
cachetools==5.2.0
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
commonmark==0.9.1
coremltools==5.2.0
cryptography==38.0.1
cycler==0.11.0
decorator==5.1.1
docutils==0.19
executing==1.1.1
fast-deskew==1.0
fonttools==4.37.1
frozenlist==1.3.1
fsspec==2022.8.2
google-auth==2.11.0
google-auth-oauthlib==0.4.6
grpcio==1.48.1
idna==3.4
imageio==2.21.2
importlib-metadata==4.12.0
importlib-resources==5.9.0
ipython==8.5.0
jaraco.classes==3.2.2
jedi==0.18.1
jeepney==0.8.0
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.15.0
keyring==23.9.1
kiwisolver==1.4.4
kraken==4.1.2
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
mean-average-precision==2021.4.26.0
more-itertools==8.14.0
mpmath==1.2.1
multidict==6.0.2
networkx==2.8.6
numpy==1.23.4
oauthlib==3.2.0
opencv-python==4.6.0.66
opencv-python-headless==4.6.0.66
packaging==21.3
pandas==1.4.4
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pkginfo==1.8.3
pkgutil_resolve_name==1.3.10
prompt-toolkit==3.0.31
protobuf==3.19.4
psutil==5.9.2
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pyDeprecate==0.3.2
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-bidi==0.4.2
python-dateutil==2.8.2
pytorch-lightning==1.7.5
pytz==2022.2.1
PyWavelets==1.3.0
PyYAML==6.0
qudida==0.0.4
readme-renderer==37.1
regex==2022.8.17
requests==2.28.1
requests-oauthlib==1.3.1
requests-toolbelt==0.9.1
rfc3986==2.0.0
rich==12.5.1
rsa==4.9
scikit-image==0.19.2
scikit-learn==1.2.0
scipy==1.9.1
seaborn==0.12.0
SecretStorage==3.3.3
Shapely==1.8.4
six==1.16.0
stack-data==0.5.1
sympy==1.11.1
tabulate==0.8.10
tensorboard==2.10.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
thop==0.1.1.post2209072238
threadpoolctl==3.1.0
tifffile==2022.8.12
torch==1.11.0+cu113
torchmetrics==0.9.3
torchvision==0.12.0+cu113
tqdm==4.64.1
traitlets==5.4.0
twine==4.0.1
typing_extensions==4.4.0
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==2.2.2
yarl==1.8.1
zipp==3.8.1

colibrisson commented 1 year ago

Since version 1.8, lightning module's configure_optimizers method is supposed to return a dict: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#configure-optimizers.

On Fri, Jan 27, 2023 at 2:32 PM Thibault Clérice @.***> wrote:

Reopened #420.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

mittagessen commented 1 year ago

@colibrisson Are you sure? The docs still say any of the 6 options are ok. And usually lightning outputs copious warnings for everything and anything, so I doubt that that's the culprit.

@PonteIneptique The main difference is the pytorch-lightning installation. Can you try pinning it to the version that works well (it isn't exactly clear which of the listings is the 'good' one) ? Any version above 1.6 (?) should be fine with any kraken commit if I remember correctly.

colibrisson commented 1 year ago

I had exactly the same issue with another training script and, despite what the documentation says, I had to update the configure_optimizers method.

On Fri, Jan 27, 2023 at 3:06 PM mittagessen @.***> wrote:

@colibrisson Are you sure? The docs still say any of the 6 options are ok. And usually lightning outputs copious warnings for everything and anything, so I doubt that that's the culprit.

@PonteIneptique The main difference is the pytorch-lightning installation. Can you try pinning it to the version that works well (it isn't exactly clear which of the listings is the 'good' one) ? Any version above 1.6 (?) should be fine with any kraken commit if I remember correctly.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

PonteIneptique commented 1 year ago

So the reason LRonPlateau fails (and some might as well) is that you moved the interval check from default to step https://github.com/mittagessen/kraken/compare/4.1.2...master#diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701L429-L430 (?)

Just changing step -> epoch fixes this.

PonteIneptique commented 1 year ago

Just changing this should probably not be done though (or should it ?) as it links to this:

https://github.com/mittagessen/kraken/blob/0306f0c0420720e311bb5102cf72556ef1a6f2f7/kraken/lib/train.py#L595-L602

Either we move the logging of accuracy at the last step, and it would work then (probably ?) or move the configuration of those LRScheduler to epoch

colibrisson commented 1 year ago

Did you wait until the end of the epoch when the scheduler is actually called?

On Sun, Jan 29, 2023, 8:50 PM Thibault Clérice @.***> wrote:

So the reason LRonPlateau fails (and some might as well) is that you moved the interval check from default to step 4.1.2...master

diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701L429-L430

https://github.com/mittagessen/kraken/compare/4.1.2...master#diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701L429-L430 (?)

Just changing step -> epoch fixes this.

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/420#issuecomment-1407754339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRMFQ23DDNJOELVNEHDWU3C2BANCNFSM6AAAAAAUHIAVF4 . You are receiving this because you were mentioned.Message ID: @.***>

PonteIneptique commented 1 year ago

I had to comment out lines 601 and 602 for the thing to work. But not moving to epoch would simply crash training as soon as it reached a step end.

On another note, with this fix, training still seems to be somewhat not converging as fast as 4.1.2 (but still training).

mittagessen commented 1 year ago

I just fixed it in master. Because of LR warmup we need to call lr_scheduler_step after each batch and then filter in the function to determine if the scheduler should actually be stepped. That causes a test for the existence of the validation metric in the metric dictionary to fail which caused the crash. Fortunately, the monitoring code has a non-strict mode which skips the test.

PonteIneptique commented 1 year ago

@mittagessen In this context, maybe I am wrong, but it feels like the val_metric is computed at the on_val_epoch_end, which means that the val taken into account for the scheduler would be the one of the previous epoch, as, if I remember well, on_val_step_end is called before on_val_epoch_end.

But I could be wrong :)

mittagessen / kraken

Reduce LRonPlateau crashes #420

diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701L429-L430