Closed PonteIneptique closed 1 year ago
Seems that pytorch-lightning has changed some logging behavior again.... I'll look into it.
@mittagessen I am really worried something happened in the dependencies.
I have been hitting my head against a wall for the past two days because the model we have dropped like ~8 % on dev, staying at 82%. I rolled back to the deps using a freeze of a previous install, and re-reached 88% or so (Still training).
It might have been only --reduceonplateau, but this feels wrong no matter what, I just don't know what's going on.
The previous install was 4.2.0 or another dev branch? reduceonplateau
shouldn't have that much of an impact (I personally don't use it because I never saw consistent improvements with it) but a broken pytorch/cuda/cudnn/pytorch-lightning installation can conceivably produce degradation like that.
Requirements 1 (82%)
aiohttp==3.8.3
aiosignal==1.3.1
albumentations==1.3.0
async-timeout==4.0.2
attrs==22.2.0
certifi==2022.12.7
charset-normalizer==2.1.1
chocomufin==0.1.10
click==8.1.3
commonmark==0.9.1
coremltools==6.1
frozenlist==1.3.3
fsspec==2022.11.0
idna==3.4
imageio==2.24.0
importlib-resources==5.10.2
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.17.3
kraken==4.2.0
lightning-utilities==0.5.0
lxml==4.6.3
MarkupSafe==2.1.1
mpmath==1.2.1
mufidecode==0.1.0
multidict==6.0.4
networkx==3.0
numpy==1.23.1
opencv-python-headless==4.7.0.68
packaging==23.0
Pillow==9.4.0
pkgutil_resolve_name==1.3.10
protobuf==3.20.1
pyarrow==10.0.1
Pygments==2.14.0
pyrsistent==0.19.3
python-bidi==0.4.2
pytorch-lightning==1.8.6
PyWavelets==1.4.1
PyYAML==6.0
qudida==0.0.4
regex==2022.4.24
requests==2.28.1
rich==13.0.1
scikit-image==0.19.2
scikit-learn==1.2.1
scipy==1.10.0
shapely==2.0.0
six==1.16.0
sympy==1.11.1
tabulate==0.8.9
tensorboardX==2.5.1
threadpoolctl==3.1.0
tifffile==2022.10.10
torch==1.11.0+cu113
torchmetrics==0.11.0
torchvision==0.12.0+cu113
tqdm==4.61.1
typing_extensions==4.4.0
Unidecode==1.2.0
urllib3==1.26.13
yarl==1.8.2
zipp==3.11.0
Requirements 2 (88% en going)
absl-py==1.2.0
aiohttp==3.8.1
aiosignal==1.2.0
albumentations==1.3.0
asttokens==2.0.8
async-timeout==4.0.2
attrs==22.1.0
backcall==0.2.0
bleach==5.0.1
cachetools==5.2.0
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
commonmark==0.9.1
coremltools==5.2.0
cryptography==38.0.1
cycler==0.11.0
decorator==5.1.1
docutils==0.19
executing==1.1.1
fast-deskew==1.0
fonttools==4.37.1
frozenlist==1.3.1
fsspec==2022.8.2
google-auth==2.11.0
google-auth-oauthlib==0.4.6
grpcio==1.48.1
idna==3.4
imageio==2.21.2
importlib-metadata==4.12.0
importlib-resources==5.9.0
ipython==8.5.0
jaraco.classes==3.2.2
jedi==0.18.1
jeepney==0.8.0
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.15.0
keyring==23.9.1
kiwisolver==1.4.4
kraken==4.1.2
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
mean-average-precision==2021.4.26.0
more-itertools==8.14.0
mpmath==1.2.1
multidict==6.0.2
networkx==2.8.6
numpy==1.23.4
oauthlib==3.2.0
opencv-python==4.6.0.66
opencv-python-headless==4.6.0.66
packaging==21.3
pandas==1.4.4
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pkginfo==1.8.3
pkgutil_resolve_name==1.3.10
prompt-toolkit==3.0.31
protobuf==3.19.4
psutil==5.9.2
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pyDeprecate==0.3.2
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-bidi==0.4.2
python-dateutil==2.8.2
pytorch-lightning==1.7.5
pytz==2022.2.1
PyWavelets==1.3.0
PyYAML==6.0
qudida==0.0.4
readme-renderer==37.1
regex==2022.8.17
requests==2.28.1
requests-oauthlib==1.3.1
requests-toolbelt==0.9.1
rfc3986==2.0.0
rich==12.5.1
rsa==4.9
scikit-image==0.19.2
scikit-learn==1.2.0
scipy==1.9.1
seaborn==0.12.0
SecretStorage==3.3.3
Shapely==1.8.4
six==1.16.0
stack-data==0.5.1
sympy==1.11.1
tabulate==0.8.10
tensorboard==2.10.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
thop==0.1.1.post2209072238
threadpoolctl==3.1.0
tifffile==2022.8.12
torch==1.11.0+cu113
torchmetrics==0.9.3
torchvision==0.12.0+cu113
tqdm==4.64.1
traitlets==5.4.0
twine==4.0.1
typing_extensions==4.4.0
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==2.2.2
yarl==1.8.1
zipp==3.8.1
Since version 1.8, lightning module's configure_optimizers
method is
supposed to return a dict:
https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#configure-optimizers.
On Fri, Jan 27, 2023 at 2:32 PM Thibault Clérice @.***> wrote:
Reopened #420.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@colibrisson Are you sure? The docs still say any of the 6 options are ok. And usually lightning outputs copious warnings for everything and anything, so I doubt that that's the culprit.
@PonteIneptique The main difference is the pytorch-lightning installation. Can you try pinning it to the version that works well (it isn't exactly clear which of the listings is the 'good' one) ? Any version above 1.6 (?) should be fine with any kraken commit if I remember correctly.
I had exactly the same issue with another training script and, despite
what the documentation says, I had to update the
configure_optimizers
method.
On Fri, Jan 27, 2023 at 3:06 PM mittagessen @.***> wrote:
@colibrisson Are you sure? The docs still say any of the 6 options are ok. And usually lightning outputs copious warnings for everything and anything, so I doubt that that's the culprit.
@PonteIneptique The main difference is the pytorch-lightning installation. Can you try pinning it to the version that works well (it isn't exactly clear which of the listings is the 'good' one) ? Any version above 1.6 (?) should be fine with any kraken commit if I remember correctly.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
So the reason LRonPlateau fails (and some might as well) is that you moved the interval check from default to step
https://github.com/mittagessen/kraken/compare/4.1.2...master#diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701L429-L430 (?)
Just changing step
-> epoch
fixes this.
Just changing this should probably not be done though (or should it ?) as it links to this:
Either we move the logging of accuracy at the last step, and it would work then (probably ?) or move the configuration of those LRScheduler to epoch
Did you wait until the end of the epoch when the scheduler is actually called?
On Sun, Jan 29, 2023, 8:50 PM Thibault Clérice @.***> wrote:
So the reason LRonPlateau fails (and some might as well) is that you moved the interval check from default to step 4.1.2...master
diff-974b9dde6ef2001d713a0eaaeb9bb2bd5f0f9c67e10eb1fb6d3aa1ccebee9701L429-L430
Just changing step -> epoch fixes this.
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/420#issuecomment-1407754339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZKGRMFQ23DDNJOELVNEHDWU3C2BANCNFSM6AAAAAAUHIAVF4 . You are receiving this because you were mentioned.Message ID: @.***>
I had to comment out lines 601 and 602 for the thing to work. But not moving to epoch
would simply crash training as soon as it reached a step end.
On another note, with this fix, training still seems to be somewhat not converging as fast as 4.1.2 (but still training).
I just fixed it in master. Because of LR warmup we need to call lr_scheduler_step
after each batch and then filter in the function to determine if the scheduler should actually be stepped. That causes a test for the existence of the validation metric in the metric dictionary to fail which caused the crash. Fortunately, the monitoring code has a non-strict mode which skips the test.
@mittagessen In this context, maybe I am wrong, but it feels like the val_metric is computed at the on_val_epoch_end
, which means that the val taken into account for the scheduler would be the one of the previous epoch, as, if I remember well, on_val_step_end
is called before on_val_epoch_end
.
But I could be wrong :)
When using
--lr reduceonplateau
, throws: