neptune-ai / neptune-client

πŸ“˜ The experiment tracker for foundation model training
https://neptune.ai
Apache License 2.0
583 stars 63 forks source link

PyTorch Lightning: error when logging model checkpoints with slow connection: FileNotFoundError when uploading model checkpoint #884

Closed fanqiehc closed 1 year ago

fanqiehc commented 2 years ago

Describe the bug

Training process was killed, because Neptune client background thread throw an exception.

Reproduction

  1. Set save_top_k=10 to ModelCheckpoint
  2. Call trainer.fit and wait.
  3. Checkpoint was deleted before upload finished (due to poor network).

Expected behavior

Should not raise exception.

Traceback

Unexpected error occurred in Neptune background thread: Killing Neptune asynchronous thread. All data is safe on disk and can be later synced manually using `neptune sync` command.
Exception in thread NeptuneAsyncOpProcessor:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 53, in run
    self.work()
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 214, in work
    self.process_batch(batch, version)
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 77, in wrapper
    result = func(self_, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 227, in process_batch
    processed_count, errors = self._processor._backend.execute_operations(
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 481, in execute_operations
    self._execute_upload_operations_with_400_retry(
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 580, in _execute_upload_operations_with_400_retry
    return self._execute_upload_operations(
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 535, in _execute_upload_operations
    upload_errors = upload_file_attribute(
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 108, in upload_file_attribute
    _multichunk_upload_with_retry(
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 319, in _multichunk_upload_with_retry
    return _multichunk_upload(
  File "/opt/conda/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_file_operations.py", line 369, in _multichunk_upload
    for idx, chunk in enumerate(chunker.generate()):
  File "/opt/conda/lib/python3.8/site-packages/neptune/internal/storage/datastream.py", line 70, in generate
    if last_change < os.stat(self._filename).st_mtime:
FileNotFoundError: [Errno 2] No such file or directory: '/work/initseg/models/unetr-epoch01-val_loss2.44.ckpt'

Environment

pytorch-lightning             1.6.1
torch                         1.10.0a0+0aef44c
neptune-client                0.16.0

Ubuntu 20..04 Python 3.8.12

Additional context

Blaizzy commented 2 years ago

Hi

Prince Canuma here, a Data scientist at Neptune.ai

I think the exception makes sense, we have to let you know that your file you tracked wasn't uploaded.

Please help me with the following:

fanqiehc commented 2 years ago

I think the situation can be described as bellow:

Maybe it's a design choice. The logger tracks top N checkpoints only or all checkpoints which are generated during the whole training progress.

bartoszptak commented 2 years ago

Same issue here.

# --------- pytorch --------- #
torch==1.11.0
torchvision==0.12.0
pytorch-lightning==1.6.0
torchmetrics==0.7.3

# --------- hydra --------- #
hydra-core==1.1.2
hydra-colorlog==1.1.0
hydra-optuna-sweeper==1.1.2

# --------- loggers --------- #
neptune-client==0.16.0
bartoszptak commented 2 years ago

After change save_top_k: 1 # save k best models (determined by above metric) to save_top_k: 0 in the pytorch_lightning.callbacks.ModelCheckpoint module, the error doesn't occur more.

kamil-kaczmarek commented 2 years ago

@Blaizzy forwarding this issue to you.

Blaizzy commented 2 years ago

Hi @fanqiehc and @bartoszptak,

Sorry for the late response!

Thank you very much for the extra details about the issue!

I'm currently reproducing the error so I can have a better understanding of what is really causing the issue. So far, I confirmed what @bartoszptak reported here:

After changing save_top_k: 1 # save k best models (determined by above metric) to save_top_k: 0 in the pytorch_lightning.callbacks.ModelCheckpoint module, the error doesn't occur more.

I ran the code on a notebook, so there was no error, but I noticed no checkpoints got logged.

Let me dig deeper and come back with my findings!

Blaizzy commented 2 years ago

I managed to replicate the issue. Enviroment:

pytorch-lightning==1.6.0 neptune-client==0.16.0

Basically if you have connection issues while logging metadata to neptune the neptune-client should either try to reconnect or switch to offline mode. Therefore in your case where you have model checkpoint files being updated/delete while neptune is still trying to upload the previous version it should only give you a warning, continue training and upload future files(or the last one).

My traceback

(neptune_test_env) prince_canuma@Princes-MacBook-Air Downloads % python "./neptune_pytorch_lightning (1).py" https://app.neptune.ai/common/pytorch-lightning-integration/e/PTL-279 Epoch 6: 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1144/1875 [01:19<00:51, 14.33it/s, loss=0.758, v_num=-279]Experiencing connection interruptions. Will try to reestablish communication with Neptune. Internal exception was: RequestsFutureAdapterConnectionError Epoch 6: 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1533/1875 [01:22<00:18, 18.65it/s, loss=0.708, v_num=-279]Communication with Neptune restored! Epoch 6: 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1680/1875 [01:23<00:09, 20.22it/s, loss=0.75, v_num=-279]Experiencing connection interruptions. Will try to reestablish communication with Neptune. Internal exception was: RequestsFutureAdapterConnectionError Epoch 7: 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1742/1875 [01:36<00:07, 18.12it/s, loss=0.672, v_num=-279]Error occurred during asynchronous operation processing: Cannot upload file /Users/prince_canuma/Downloads/my_model/checkpoints/epoch=04.ckpt: Path not found or is a not a file. Error occurred during asynchronous operation processing: Cannot delete training/model/checkpoints/epoch=03: Attribute does not exist Epoch 8: 15%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 279/1875 [01:38<09:25, 2.82it/s, loss=0.84, v_num=-279]Error occurred during asynchronous operation processing: Cannot upload file /Users/prince_canuma/Downloads/my_model/checkpoints/epoch=05.ckpt: Path not found or is a not a file. Error occurred during asynchronous operation processing: Cannot delete training/model/checkpoints/epoch=04: Attribute does not exist Epoch 8: 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 880/1875 [01:42<01:56, 8.55it/s, loss=0.723, v_num=-279]Error occurred during asynchronous operation processing: Cannot delete training/model/checkpoints/epoch=05: Attribute does not exist Epoch 19: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1875/1875 [03:56<00:00, 7.91it/s, loss=0.645, v_num=-279] Shutting down background jobs, please wait a moment... Done! Waiting for the remaining 62 operations to synchronize with Neptune. Do not kill this process. All 62 operations synced, thanks for waiting!

Blaizzy commented 2 years ago

Solution βœ… If you are having connection issues, I would avoid using the save_top_k feature because if the file gets deleted before neptune-client has chance to upload it will cause an error. Although the error will not stop you run it will most definitely result in lost model checkpoints.

In the event of connection issues, you can disable save_top_k by skipping the argument then you can try one of the following:

  1. Manual saving after the training is done.
    
    trainer.save_checkpoint("example.ckpt")

run["checkpoints/best"].upload("example.ckpt")


2. Save best model

checkpoint_callback = ModelCheckpoint( monitor='val_loss', dirpath='my/path/' )



3. `every_n_epochs`

`checkpoint_callback = ModelCheckpoint(monitor="val_loss", every_n_epochs=1)`

4.`save_last` if you want the best 

`checkpoint_callback = ModelCheckpoint(monitor="val_loss", save_last=True)`

5. `save_on_train_epoch_end`

`checkpoint_callback = ModelCheckpoint(monitor="val_loss", save_on_train_epoch_end=True)`

 Docs: https://pytorch-lightning.readthedocs.io/en/stable/common/checkpointing.html
Blaizzy commented 2 years ago

@fanqiehc would you mind telling me which environment you are running your script? and any other details you think could help me pin-point the issue.

Because according to my tests neptune-client 0.16.0 should not stop the way it did for you πŸ€“

Blaizzy commented 2 years ago

Hey @fanqiehc! πŸ˜ƒ

Just checking in to see if the my answered helper you or if you still need help with this?

Blaizzy commented 2 years ago

Hey there,

I'll be closing this issue as it's stale.

Feel free to reach out with any other questions, issues or feature requests, we are happy to help!

alvitawa commented 2 years ago

I have this issue with top_k turned off when I have bad internet, and with top_k=2 when I have good internet. I think the fact that I'm training a very small model causes neptune to struggle to catch up with the fast epochs.

For me, to solve this problem I ended up disabling neptune's checkpointing:

run = neptune.init(tags=[])
logger = NeptuneLogger(run=run, log_model_checkpoints=False)
Blaizzy commented 2 years ago

Hi @alvitawa

I explain the issue here: https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121536603

The solution βœ… here: https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121607944

Disabling checkpointing is an option, but there are better alternatives that don't require you to lose your checkpoints. πŸ‘†

alvitawa commented 2 years ago

Your solutions don't work for me, because I want to keep the best model and neptune errors out even when top_k=1. Note that I didn't disable checkpointing (as is done by pythorch lighting) but neptune's checkpointing, which prevents neptune from crashing. A more complete example of what I do:

run = neptune.init(tags=[])
logger = NeptuneLogger(run=run, log_model_checkpoints=False)
checkpoint_callback = pl.callbacks.ModelCheckpoint(dirpath='data/models/checkpoints', monitor='val/loss',
                                                           mode='min', save_top_k=1)
trainer = pl.Trainer(logger=logger, max_epochs=cfg.dl.epochs, gpus=1, accelerator=device,
                             log_every_n_steps=1, check_val_every_n_epoch=5,
                             callbacks=[checkpoint_callback])

As an aside: Why doesn't neptune just cancel an upload if the checkpoint file is removed, instead of crashing completely? Like I said, this doesn't only happen when the connection is bad but also when training happens too fast.

Blaizzy commented 2 years ago

@alvitawa the problem happens because of using save_top_k.

You can save the best model using any of the options in my solution.

I have tested and validated them, please read it again: https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121607944

Blaizzy commented 2 years ago

You are right @alvitawa it should give a warning and continue instead of crashing. That's because when training happens to fast the checkpoint file gets overloaded and the version the Neptune-client wants to upload stops existing.

This is a known issue and we have it in our backlog.

As I mentioned here:

https://github.com/neptune-ai/neptune-client/issues/884#issuecomment-1121536603

Blaizzy commented 2 years ago

Re-opening

Upon closer inspection, I noticed that this bug arises when running scripts in GPU environments.

shnela commented 1 year ago

We've dropped the limit for in-memory uploaded files size in neptune=0.16.12, so here's a nice workaround.

The sample code below copies the original file to some location in .neptune catalogue, and the client will remove copied entry once the upload is finished. You can even manually delete the checkpoint after the upload action is called asynchronously. Since streams are used this method will work even for files larger than available operation memory.

checkpoint_path = "str file path of the checkpoint"

with open(checkpoint_path, "rb") as f:
    self._run[path] = File.from_stream(f)  # here copy of the original file is stored on the disk and will be removed after asynchronous upload is finished
os.remove(checkpoint_path)

The code should be used in our internal Pytorch-Lightning repository.

SiddhantSadangi commented 1 year ago

Hello @fanqiehc , @bartoszptak πŸ‘‹

Sorry for the late update here.

This should have been fixed in the latest release of Lightning (v2.0.4) βœ…

Additionally, Lightning 2.0.4 also includes support for Neptune 1.0+ πŸŽ‰ I am including our neptune 1.0 upgrade guide to help you upgrade from neptune-client<1.0 to neptune>=1.0.

Please let me know if this solves your issue πŸ™