Closed JonasFrey96 closed 3 years ago
Hi Jonas,
Resuming experiment is not supported in the current version of Neptune - there is a new major version of Neptune coming out soon, which will support resuming experiments - please keep an eye on new Neptune versions.
In the meantime, the error you're seeing appears to be caused by an attempt to create a channel in an experiment where a channel of that name already exists. This is probably due to the way you modified Neptune.
BTW - was there any missing functionality in Neptune which made you tweak your own version?
Hello, yes in PytorchLightning integration your are not able to set a proxy. I checked and the issue is not due to the proxy but due to the channel problematic. Fixing this would be great.
Best Jonas
Hi Jonas,
Sorry to hear you needed to tweak Neptune-PTL integration to use a proxy - we've added fixing this to our backlog.
Hi Jonas,
Resuming experiment is not supported in the current version of Neptune - there is a new major version of Neptune coming out soon, which will support resuming experiments - please keep an eye on new Neptune versions.
In the meantime, the error you're seeing appears to be caused by an attempt to create a channel in an experiment where a channel of that name already exists. This is probably due to the way you modified Neptune.
BTW - was there any missing functionality in Neptune which made you tweak your own version?
Having the ability to resume experiments is very much needed:) it is often the case where a machine dies in the middle of a run, and a clean API to restart from the failing point is helpful.
Thanks!
Hi Jonas, just to clarify: resuming experiments is possible in the current version of neptune - you can log metrics to an old experiment etc. What the upcoming version will add on top of that is to also update the state of the experiment when it is resumed
@PiotrJander What's the best practice to do so using Pytorch Lightning?
I.e given that I have a crashed experiment, and I wish to resume it (e.g using the resume_from_checkpoint
flag), How should I load the old experiment?
With comet you can revive an experiment by passing the constructor with the old id.
Thanks!
Hi @tdvginz
It is possible to resume Pytorch Lightning experiments by passing the experiment_id
parameter when constructing a NeptuneLogger
instance:
from pytorch_lightning.loggers.neptune import NeptuneLogger
neptune_logger = NeptuneLogger(
api_key="...",
project_name="shared/pytorch-lightning-integration",
experiment_id='PYTOR-163701')
Thanks a lot for your kind response.
Best Jonas
@PiotrJander hi, I'm currently looking into a similar issue - using neptune with pytorch lightning to log experiments run on spot (preemptible) instances. As fas as I understand the docs to the new API does not support PL integration yet, does it? And if I were to use the old one, resumed experiment would not have its status updated, nor its stdout and stderr logs or hardware consumption. In a reply above you've written that the new API fixes updating the status of the resumed experiment. How about hardware consumption metrics though? Would these resume tracking in a resumed experiment in the new API?
Best, Przemek
Hi @PrzemekPobrotyn
Support for pytorch-lightning
integration is currenly waiting to be merged to pytorch-lightning
repo.
Unfortunately due to major version update in pytorch-lightning
, updating features was freezed.
I hope that neptune
support will be published in the next release.
You can monitor status of this feature here: https://github.com/PyTorchLightning/pytorch-lightning/pull/6867 .
Best Jakub
Hi,
thanks for a quick reply. How about the gpu usage monitoring in a resumed experiment? Is this supported in the new API? How about logging stdout & stderr?
Best, Przemek
How about the gpu usage monitoring in a resumed experiment? Is this supported in the new API? How about logging stdout & stderr?
Yes, it is. Reasuming is fully supported in new API. :)
awesome, can't wait! thanks!
@PrzemekPobrotyn we've released the PyTorch Lightning integration as a separate package for now: https://docs.neptune.ai/integrations-and-supported-tools/model-training/pytorch-lightning
Note that this is an update of the current integration with the new API, not a full re-write. It does take advantage of some of the new stuff like hierarchical structure, but we will be iterating on the "UX" of the integration so please note that there may be breaking changes once we release next (1.0+) version of this integration.
Hello,
When reloading an experiment and continuing to train the network neptune fails when logging to existing channels. Also to mention I am behind a proxy and have modifed the NepuneLogger of pytorch lightning:
Best Jonas
Here is the Traceback: