neptune-ai / neptune-client

📘 The experiment tracker for foundation model training
https://neptune.ai
Apache License 2.0
587 stars 63 forks source link

PyTorch Lightning Integration #413

Closed JonasFrey96 closed 3 years ago

JonasFrey96 commented 3 years ago

Hello,

When reloading an experiment and continuing to train the network neptune fails when logging to existing channels. Also to mention I am behind a proxy and have modifed the NepuneLogger of pytorch lightning:

Best Jonas

def _create_or_get_experiment2(self):
  proxies = {
  'http': 'http://magic.xyz:3128',
  'https': 'http://magic.xyz:3128',
  }
  if self.offline_mode:
      project = neptune.Session(backend=neptune.OfflineBackend()).get_project('dry-run/project')
  else:
      #project_qualified_name='jonasfrey96/ASL', api_token=os.environ["NEPTUNE_API_TOKEN"], proxies=proxies
      session = neptune.init(project_qualified_name='jonasfrey96/ASL', api_token=self.api_key,proxies=proxies) # add your credential
      print(type(session))
      session = neptune.Session(api_token=self.api_key,proxies=proxies)
      project = session.get_project(self.project_name)

  if self.experiment_id is None:
      e = project.create_experiment(name=self.experiment_name, **self._kwargs)
      self.experiment_id = e.id
  else:
      e = project.get_experiments(id=self.experiment_id)[0]
      self.experiment_name = e.get_system_properties()['name']
      self.params = e.get_parameters()
      self.properties = e.get_properties()
      self.tags = e.get_tags()
  return e
NeptuneLogger._create_or_get_experiment = _create_or_get_experiment2 # Super bad !!!

Here is the Traceback:

Traceback (most recent call last):
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 337, in unmarshal_response
    op=operation,
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 370, in unmarshal_response_inner
    response_spec = get_response_spec(status_code=response.status_code, op=op)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado_core/response.py", line 160, in get_response_spec
    "status_code or use a `default` response.".format(status_code, op),
bravado_core.exception.MatchingResponseNotFound: Response specification matching http status_code 409 not found for operation Operation(createChannel). Either add a response specification for the status_code or use a `default` response.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 483, in create_channel
    channelToCreate=params
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 200, in response
    swagger_result = self._get_swagger_result(incoming_response)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 124, in wrapper
    return func(self, *args, **kwargs)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 303, in _get_swagger_result
    self.request_config.response_callbacks,
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 347, in unmarshal_response
    sys.exc_info()[2])
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 337, in unmarshal_response
    op=operation,
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado/http_future.py", line 370, in unmarshal_response_inner
    response_spec = get_response_spec(status_code=response.status_code, op=op)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/bravado_core/response.py", line 160, in get_response_spec
    "status_code or use a `default` response.".format(status_code, op),
bravado.exception.HTTPConflict: 409 : Response specification matching http status_code 409 not found for operation Operation(createChannel). Either add a response specification for the status_code or use a `default` response.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 482, in <module>
    val_dataloaders= dataloader_list_test)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 513, in fit
    self.dispatch()
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in dispatch
    self.accelerator.start_training(self)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
    self._results = trainer.run_train()
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 644, in run_train
    self.train_loop.run_training_epoch()
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 559, in run_training_epoch
    epoch_output, self.checkpoint_accumulator, self.early_stopping_accumulator, self.num_optimizers
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 469, in log_train_epoch_end_metrics
    self.log_metrics(epoch_log_metrics, {})
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 244, in log_metrics
    self.trainer.logger.save()
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 302, in save
    self._finalize_agg_metrics()
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 145, in _finalize_agg_metrics
    self.log_metrics(metrics=metrics_to_log, step=agg_step)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 40, in wrapped_fn
    return fn(*args, **kwargs)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/loggers/neptune.py", line 259, in log_metrics
    self.log_metric(key, val)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 40, in wrapped_fn
    return fn(*args, **kwargs)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/pytorch_lightning/loggers/neptune.py", line 302, in log_metric
    self.experiment.log_metric(metric_name, metric_value)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/neptune/experiments.py", line 375, in log_metric
    self._channels_values_sender.send(log_name, ChannelType.NUMERIC.value, value)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/neptune/internal/channels/channels_values_sender.py", line 66, in send
    response = self._experiment._create_channel(channel_name, channel_type, channel_namespace)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/neptune/experiments.py", line 1195, in _create_channel
    return self._backend.create_channel(self, channel_name, channel_type)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/neptune/utils.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/cluster/home/jonfrey/miniconda3/envs/track4/lib/python3.7/site-packages/neptune/internal/backends/hosted_neptune_backend.py", line 492, in create_channel
    raise ChannelAlreadyExists(channel_name=name, experiment_short_id=experiment.id)
PiotrJander commented 3 years ago

Hi Jonas,

Resuming experiment is not supported in the current version of Neptune - there is a new major version of Neptune coming out soon, which will support resuming experiments - please keep an eye on new Neptune versions.

In the meantime, the error you're seeing appears to be caused by an attempt to create a channel in an experiment where a channel of that name already exists. This is probably due to the way you modified Neptune.

BTW - was there any missing functionality in Neptune which made you tweak your own version?

JonasFrey96 commented 3 years ago

Hello, yes in PytorchLightning integration your are not able to set a proxy. I checked and the issue is not due to the proxy but due to the channel problematic. Fixing this would be great.

Best Jonas

PiotrJander commented 3 years ago

Hi Jonas,

Sorry to hear you needed to tweak Neptune-PTL integration to use a proxy - we've added fixing this to our backlog.

tdvginz commented 3 years ago

Hi Jonas,

Resuming experiment is not supported in the current version of Neptune - there is a new major version of Neptune coming out soon, which will support resuming experiments - please keep an eye on new Neptune versions.

In the meantime, the error you're seeing appears to be caused by an attempt to create a channel in an experiment where a channel of that name already exists. This is probably due to the way you modified Neptune.

BTW - was there any missing functionality in Neptune which made you tweak your own version?

Having the ability to resume experiments is very much needed:) it is often the case where a machine dies in the middle of a run, and a clean API to restart from the failing point is helpful.

Thanks!

PiotrJander commented 3 years ago

Hi Jonas, just to clarify: resuming experiments is possible in the current version of neptune - you can log metrics to an old experiment etc. What the upcoming version will add on top of that is to also update the state of the experiment when it is resumed

tdvginz commented 3 years ago

@PiotrJander What's the best practice to do so using Pytorch Lightning? I.e given that I have a crashed experiment, and I wish to resume it (e.g using the resume_from_checkpoint flag), How should I load the old experiment? With comet you can revive an experiment by passing the constructor with the old id.

Thanks!

PiotrJander commented 3 years ago

Hi @tdvginz

It is possible to resume Pytorch Lightning experiments by passing the experiment_id parameter when constructing a NeptuneLogger instance:

from pytorch_lightning.loggers.neptune import NeptuneLogger

neptune_logger = NeptuneLogger(
    api_key="...",
    project_name="shared/pytorch-lightning-integration",
    experiment_id='PYTOR-163701')
JonasFrey96 commented 3 years ago

Thanks a lot for your kind response.

Best Jonas

PrzemekPobrotyn commented 3 years ago

@PiotrJander hi, I'm currently looking into a similar issue - using neptune with pytorch lightning to log experiments run on spot (preemptible) instances. As fas as I understand the docs to the new API does not support PL integration yet, does it? And if I were to use the old one, resumed experiment would not have its status updated, nor its stdout and stderr logs or hardware consumption. In a reply above you've written that the new API fixes updating the status of the resumed experiment. How about hardware consumption metrics though? Would these resume tracking in a resumed experiment in the new API?

Best, Przemek

shnela commented 3 years ago

Hi @PrzemekPobrotyn

Support for pytorch-lightning integration is currenly waiting to be merged to pytorch-lightning repo. Unfortunately due to major version update in pytorch-lightning, updating features was freezed. I hope that neptune support will be published in the next release.

You can monitor status of this feature here: https://github.com/PyTorchLightning/pytorch-lightning/pull/6867 .

Best Jakub

PrzemekPobrotyn commented 3 years ago

Hi,

thanks for a quick reply. How about the gpu usage monitoring in a resumed experiment? Is this supported in the new API? How about logging stdout & stderr?

Best, Przemek

aniezurawski commented 3 years ago

How about the gpu usage monitoring in a resumed experiment? Is this supported in the new API? How about logging stdout & stderr?

Yes, it is. Reasuming is fully supported in new API. :)

PrzemekPobrotyn commented 3 years ago

awesome, can't wait! thanks!

Herudaio commented 3 years ago

@PrzemekPobrotyn we've released the PyTorch Lightning integration as a separate package for now: https://docs.neptune.ai/integrations-and-supported-tools/model-training/pytorch-lightning

Note that this is an update of the current integration with the new API, not a full re-write. It does take advantage of some of the new stuff like hierarchical structure, but we will be iterating on the "UX" of the integration so please note that there may be breaking changes once we release next (1.0+) version of this integration.