Closed fanqiehc closed 1 year ago
Hi
Prince Canuma here, a Data scientist at Neptune.ai
I think the exception makes sense, we have to let you know that your file you tracked wasn't uploaded.
Please help me with the following:
A Minimal reproducible example.
Are you using a temporary file or directory to store your checkpoints?
I think the situation can be described as bellow:
save_top_k
makes the trainer only keep top N
checkpoints.Maybe it's a design choice. The logger tracks top N
checkpoints only or all checkpoints which are generated during the whole training progress.
Same issue here.
# --------- pytorch --------- #
torch==1.11.0
torchvision==0.12.0
pytorch-lightning==1.6.0
torchmetrics==0.7.3
# --------- hydra --------- #
hydra-core==1.1.2
hydra-colorlog==1.1.0
hydra-optuna-sweeper==1.1.2
# --------- loggers --------- #
neptune-client==0.16.0
After change save_top_k: 1 # save k best models (determined by above metric)
to save_top_k: 0
in the pytorch_lightning.callbacks.ModelCheckpoint
module, the error doesn't occur more.
@Blaizzy forwarding this issue to you.
Hi @fanqiehc and @bartoszptak,
Sorry for the late response!
Thank you very much for the extra details about the issue!
I'm currently reproducing the error so I can have a better understanding of what is really causing the issue. So far, I confirmed what @bartoszptak reported here:
After changing
save_top_k: 1 # save k best models (determined by above metric)
tosave_top_k: 0
in thepytorch_lightning.callbacks.ModelCheckpoint
module, the error doesn't occur more.
I ran the code on a notebook, so there was no error, but I noticed no checkpoints got logged.
Let me dig deeper and come back with my findings!
I managed to replicate the issue. Enviroment:
pytorch-lightning==1.6.0 neptune-client==0.16.0
Basically if you have connection issues while logging metadata to neptune the neptune-client should either try to reconnect or switch to offline mode. Therefore in your case where you have model checkpoint files being updated/delete while neptune is still trying to upload the previous version it should only give you a warning, continue training and upload future files(or the last one).
My traceback
(neptune_test_env) prince_canuma@Princes-MacBook-Air Downloads % python "./neptune_pytorch_lightning (1).py" https://app.neptune.ai/common/pytorch-lightning-integration/e/PTL-279 Epoch 6: 61%|ββββββββββββββββββββββββββββββββββββ | 1144/1875 [01:19<00:51, 14.33it/s, loss=0.758, v_num=-279]Experiencing connection interruptions. Will try to reestablish communication with Neptune. Internal exception was: RequestsFutureAdapterConnectionError Epoch 6: 82%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 1533/1875 [01:22<00:18, 18.65it/s, loss=0.708, v_num=-279]Communication with Neptune restored! Epoch 6: 90%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1680/1875 [01:23<00:09, 20.22it/s, loss=0.75, v_num=-279]Experiencing connection interruptions. Will try to reestablish communication with Neptune. Internal exception was: RequestsFutureAdapterConnectionError Epoch 7: 93%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1742/1875 [01:36<00:07, 18.12it/s, loss=0.672, v_num=-279]Error occurred during asynchronous operation processing: Cannot upload file /Users/prince_canuma/Downloads/my_model/checkpoints/epoch=04.ckpt: Path not found or is a not a file. Error occurred during asynchronous operation processing: Cannot delete training/model/checkpoints/epoch=03: Attribute does not exist Epoch 8: 15%|βββββββββ | 279/1875 [01:38<09:25, 2.82it/s, loss=0.84, v_num=-279]Error occurred during asynchronous operation processing: Cannot upload file /Users/prince_canuma/Downloads/my_model/checkpoints/epoch=05.ckpt: Path not found or is a not a file. Error occurred during asynchronous operation processing: Cannot delete training/model/checkpoints/epoch=04: Attribute does not exist Epoch 8: 47%|ββββββββββββββββββββββββββββ | 880/1875 [01:42<01:56, 8.55it/s, loss=0.723, v_num=-279]Error occurred during asynchronous operation processing: Cannot delete training/model/checkpoints/epoch=05: Attribute does not exist Epoch 19: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1875/1875 [03:56<00:00, 7.91it/s, loss=0.645, v_num=-279] Shutting down background jobs, please wait a moment... Done! Waiting for the remaining 62 operations to synchronize with Neptune. Do not kill this process. All 62 operations synced, thanks for waiting!
Solution β
If you are having connection issues, I would avoid using the save_top_k
feature because if the file gets deleted before neptune-client has chance to upload it will cause an error. Although the error will not stop you run it will most definitely result in lost model checkpoints.
In the event of connection issues, you can disable save_top_k by
skipping the argument then you can try one of the following:
trainer.save_checkpoint("example.ckpt")
run["checkpoints/best"].upload("example.ckpt")
2. Save best model
checkpoint_callback = ModelCheckpoint( monitor='val_loss', dirpath='my/path/' )
3. `every_n_epochs`
`checkpoint_callback = ModelCheckpoint(monitor="val_loss", every_n_epochs=1)`
4.`save_last` if you want the best
`checkpoint_callback = ModelCheckpoint(monitor="val_loss", save_last=True)`
5. `save_on_train_epoch_end`
`checkpoint_callback = ModelCheckpoint(monitor="val_loss", save_on_train_epoch_end=True)`
Docs: https://pytorch-lightning.readthedocs.io/en/stable/common/checkpointing.html
@fanqiehc would you mind telling me which environment you are running your script? and any other details you think could help me pin-point the issue.
Because according to my tests neptune-client 0.16.0 should not stop the way it did for you π€
Hey @fanqiehc! π
Just checking in to see if the my answered helper you or if you still need help with this?
Hey there,
I'll be closing this issue as it's stale.
Feel free to reach out with any other questions, issues or feature requests, we are happy to help!
I have this issue with top_k turned off when I have bad internet, and with top_k=2 when I have good internet. I think the fact that I'm training a very small model causes neptune to struggle to catch up with the fast epochs.
For me, to solve this problem I ended up disabling neptune's checkpointing:
run = neptune.init(tags=[])
logger = NeptuneLogger(run=run, log_model_checkpoints=False)
Hi @alvitawa
I explain the issue here: https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121536603
The solution β here: https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121607944
Disabling checkpointing is an option, but there are better alternatives that don't require you to lose your checkpoints. π
Your solutions don't work for me, because I want to keep the best model and neptune errors out even when top_k=1. Note that I didn't disable checkpointing (as is done by pythorch lighting) but neptune's checkpointing, which prevents neptune from crashing. A more complete example of what I do:
run = neptune.init(tags=[])
logger = NeptuneLogger(run=run, log_model_checkpoints=False)
checkpoint_callback = pl.callbacks.ModelCheckpoint(dirpath='data/models/checkpoints', monitor='val/loss',
mode='min', save_top_k=1)
trainer = pl.Trainer(logger=logger, max_epochs=cfg.dl.epochs, gpus=1, accelerator=device,
log_every_n_steps=1, check_val_every_n_epoch=5,
callbacks=[checkpoint_callback])
As an aside: Why doesn't neptune just cancel an upload if the checkpoint file is removed, instead of crashing completely? Like I said, this doesn't only happen when the connection is bad but also when training happens too fast.
@alvitawa the problem happens because of using save_top_k
.
You can save the best model using any of the options in my solution.
I have tested and validated them, please read it again: https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121607944
You are right @alvitawa it should give a warning and continue instead of crashing. That's because when training happens to fast the checkpoint file gets overloaded and the version the Neptune-client wants to upload stops existing.
This is a known issue and we have it in our backlog.
As I mentioned here:
https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121536603
Re-opening
Upon closer inspection, I noticed that this bug arises when running scripts in GPU environments.
We've dropped the limit for in-memory uploaded files size in neptune=0.16.12
, so here's a nice workaround.
The sample code below copies the original file to some location in .neptune
catalogue, and the client will remove copied entry once the upload is finished.
You can even manually delete the checkpoint after the upload action is called asynchronously.
Since streams are used this method will work even for files larger than available operation memory.
checkpoint_path = "str file path of the checkpoint"
with open(checkpoint_path, "rb") as f:
self._run[path] = File.from_stream(f) # here copy of the original file is stored on the disk and will be removed after asynchronous upload is finished
os.remove(checkpoint_path)
The code should be used in our internal Pytorch-Lightning repository.
Hello @fanqiehc , @bartoszptak π
Sorry for the late update here.
This should have been fixed in the latest release of Lightning (v2.0.4) β
Additionally, Lightning 2.0.4 also includes support for Neptune 1.0+ π
I am including our neptune 1.0 upgrade guide to help you upgrade from neptune-client<1.0
to neptune>=1.0
.
Please let me know if this solves your issue π
Describe the bug
Training process was killed, because Neptune client background thread throw an exception.
Reproduction
trainer.fit
and wait.Expected behavior
Should not raise exception.
Traceback
Environment
Ubuntu 20..04 Python 3.8.12
Additional context