Closed mehrdadfazli closed 2 years ago
Hi @mehrdadfazli, you can write a custom PyTorch Lightning callback for that:
from pytorch_lightning.callbacks import Callback
class LossLogger(Callback):
def __init__(self):
self.train_loss = []
self.val_loss = []
def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.train_loss.append(float(trainer.callback_metrics["train_loss"]))
def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.val_loss.append(float(trainer.callback_metrics["val_loss"]))
loss_logger = LossLogger()
model = SomeTorchForecastingModel(
...,
nr_epochs_val_period=1, # perform validation after every epoch
pl_trainer_kwargs={"callbacks": [loss_logger]}
)
# fit must include validation set for "val_loss"
model.fit(...)
Note that this will give you one more element in the loss_logger.val_loss
as the models perform a validation sanity check before training begins.
Thank you for your prompt response @dennisbader. The callback
seems an elegant way of getting the loss. However, I get the error below when I run my code for TCNModel
or RNNModel
.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_13862/2799376969.py in <module>
13 likelihood=GaussianLikelihood(),
14 nr_epochs_val_period=1, # perform validation after every epoch
---> 15 pl_trainer_kwargs={"callbacks": [loss_logger]}
16 # model_name='DeepTCN-with-covars-test',
17 # force_reset=True,
~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/models/forecasting/forecasting_model.py in __call__(cls, *args, **kwargs)
39 def __call__(cls, *args, **kwargs):
40 cls.model_call = (args, kwargs)
---> 41 return super(ModelMeta, cls).__call__(*args, **kwargs)
42
43
~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/utils/torch.py in decorator(self, *args, **kwargs)
68 with fork_rng():
69 manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
---> 70 return decorated(self, *args, **kwargs)
71
72 return decorator
~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/models/forecasting/tcn_model.py in __init__(self, input_chunk_length, output_chunk_length, kernel_size, num_filters, num_layers, dilation_base, weight_norm, dropout, likelihood, random_state, **kwargs)
382 kwargs["output_chunk_length"] = output_chunk_length
383
--> 384 super().__init__(likelihood=likelihood, **kwargs)
385
386 self.input_chunk_length = input_chunk_length
~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/models/forecasting/torch_forecasting_model.py in __init__(self, likelihood, **kwargs)
1310 The likelihood model to be used for probabilistic forecasts.
1311 """
-> 1312 super().__init__(**kwargs)
1313 self.likelihood = likelihood
1314
TypeError: __init__() got an unexpected keyword argument 'pl_trainer_kwargs'
You need to upgrade your darts version to 0.17.1
Thank you so much, Dennis. That helped a lot.
@dennisbader it looks like this is a frequent question and i had it too (and spend some hours figure it out without knowing this issue), my suggestion is including this to the docs
Suggestion 1: Adding a example (similar to yours) after https://unit8co.github.io/darts/userguide/torch_forecasting_models.html#early-stop
Suggestion 2: Implementing a class LearningRateMonitor()
and including it in the docs as well of course)
I'm happy to send you a PR for what ever you prefer. If Suggestion 2, please tell me where to put it (e.g. darts.torch.LearningRateMonitor
. I prefer Suggestion 2 as it fits "Focus on simplicity and clarity for end users."
Hi @turbotimon, if you want a learning rate monitor, that is already implemented in PyTorch-Lightning here
PyTorch Lightning is installed with Darts, so you can simply import it and add it to your callbacks similar as shown above for LossLogger
.
from pytorch_lightning.callbacks import LearningRateMonitor
It's a good idea to make users more aware of how they can use callbacks with Darts. Personally, I would prefer a small dedicated user guide on Callbacks for our TorchForecastingModel rather than adding it to the model docs (it's only getting bigger and bigger). And then we could reference this in the model docs.
This user guide would just make users aware that:
LossLogger
For the moment I would not add these callbacks to the Darts library as it means we would need to maintain those as well, and I'd rather leave this to PyTorch-Lightning or the user.
Hi Dennis, sorry LearningRateMonitor was bad naming, what i meant was learning cuve (test/val loss) and not step size.
I agree with the extra user guide. I'll make a PR as soon as i find time with an extra page that covers the things you mentioned above.
..rather leave this to PyTorch-Lightning
i was really surprised that Lighting doesn't have already something like "LossLogger" to easy visualize a learning curve. Something i think is crucial in ML.
Sounds great, thanks @turbotimon.
@turbotimon, I opened a dedicated issue for this (#1576) along with some more information about where to add this information and what info it should cover.
I would like to log losses to Sagemaker Experiments. But what I don't understand is how do I get the losses from the Callback "state"? Obviously TensorboardLogger picks them up from the LossLogger state, but I can't find where in the code that happens.
And if I pass in the Sagemaker "run" object to the Callback, I get a complaint that it can't be pickled (required by PL).
Can you provide some advice on how to hook up logging losses to something other than Tensorboard?
@94Sip there is now an example of an LossLogger(Callback)
here: https://unit8co.github.io/darts/userguide/torch_forecasting_models.html?highlight=losslogger#example-of-custom-callback-to-store-losses
With this example you can retreive the losses via loss_logger.val_loss
or loss_logger.train_loss
. I hope this helps
@turbotimon When I use the above example I get the following runtime error:
RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001
I have installed: torch==2.3.1+cu121 numpy==1.26.4 darts==0.29.0 tensorboard==2.16.2
Here is my code, mostly just to prove I copy pasted the logger:
from pytorch_lightning.callbacks import Callback
class LossLogger(Callback):
def __init__(self):
self.train_loss = []
self.val_loss = []
# will automatically be called at the end of each epoch
def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.train_loss.append(float(trainer.callback_metrics["train_loss"]))
def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.val_loss.append(float(trainer.callback_metrics["val_loss"]))
loss_logger = LossLogger()
model_params = {
'input_chunk_length': input_size,
'output_chunk_length': horizon,
'n_epochs': num_epochs,
'batch_size': batch_size,
'likelihood': GaussianLikelihood(),
'use_static_covariates': False,
'add_relative_index': True,
'pl_trainer_kwargs': {"callbacks": [loss_logger]}
}
model = TFTModel(**model_params)
@eye4got not sure what happens here, but doesn't seems to me that this is related to the LossLogger. Can you reproduce this error with a small and independend example like e.g. this?
from darts.datasets import WeatherDataset
from darts.models import TFTModel
series = WeatherDataset().load()
# predicting atmospheric pressure
target = series['p (mbar)'][:100]
# optionally, past observed rainfall (pretending to be unknown beyond index 100)
past_cov = series['rain (mm)'][:100]
# future temperatures (pretending this component is a forecast)
future_cov = series['T (degC)'][:106]
# loss logger setup
class LossLogger(Callback):
def __init__(self):
self.train_loss = []
self.val_loss = []
# will automatically be called at the end of each epoch
def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.train_loss.append(float(trainer.callback_metrics["train_loss"]))
def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.val_loss.append(float(trainer.callback_metrics["val_loss"]))
loss_logger = LossLogger()
pl_trainer_kwargs = {"callbacks": [loss_logger]}
# by default, TFTModel is trained using a `QuantileRegression` making it a probabilistic forecasting model
model = TFTModel(
input_chunk_length=6,
output_chunk_length=6,
n_epochs=5,
nr_epochs_val_period=1, # perform validation after every epoch
pl_trainer_kwargs=pl_trainer_kwargs,
)
# future_covariates are mandatory for `TFTModel`
model.fit(target, past_covariates=past_cov, future_covariates=future_cov)
# TFTModel is probabilistic by definition; using `num_samples >> 1` to generate probabilistic forecasts
pred = model.predict(6, num_samples=100)
# shape : (forecast horizon, components, num_samples)
pred.all_values().shape
(6, 1, 100)
# showing the first 3 samples for each timestamp
pred.all_values()[:,:,:3]
print(loss_logger.train_loss, loss_logger.val_loss)
tested with:
darts==0.30.0
pytorch-lightning==2.3.1
Hi @turbotimon
I have narrowed it down to some kind of interaction between the LossLogger and when you recreate the model. The following example will cause the error. Commenting out the pl_trainer_kwargs
parameter, which links the loss logger, stops this error from being raised. Similarly, if I don't initialise the model, test the learning rate, and then reinitialise the model, the LossLogger works fine.
from darts.datasets import WeatherDataset
from darts.models import TFTModel
from pytorch_lightning.callbacks import Callback
series = WeatherDataset().load()
# predicting atmospheric pressure
target = series['p (mbar)'][:100]
# optionally, past observed rainfall (pretending to be unknown beyond index 100)
past_cov = series['rain (mm)'][:100]
# future temperatures (pretending this component is a forecast)
future_cov = series['T (degC)'][:106]
# loss logger setup
class LossLogger(Callback):
def __init__(self):
self.train_loss = []
self.val_loss = []
# will automatically be called at the end of each epoch
def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.train_loss.append(float(trainer.callback_metrics["train_loss"]))
def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.val_loss.append(float(trainer.callback_metrics["val_loss"]))
loss_logger = LossLogger()
pl_trainer_kwargs = {"callbacks": [loss_logger]}
# by default, TFTModel is trained using a `QuantileRegression` making it a probabilistic forecasting model
model_params = {
'input_chunk_length': 6,
'output_chunk_length': 6,
'n_epochs': 5,
'nr_epochs_val_period': 1, # perform validation after every epoch
'pl_trainer_kwargs': pl_trainer_kwargs,
}
model = TFTModel(**model_params)
lr_finder = model.lr_find(series=target, past_covariates=past_cov, future_covariates=future_cov)
lr = lr_finder.suggestion()
model_params['optimizer_kwargs'] = {'lr': lr}
model = TFTModel(**model_params)
# future_covariates are mandatory for `TFTModel`
model.fit(target, past_covariates=past_cov, future_covariates=future_cov)
# TFTModel is probabilistic by definition; using `num_samples >> 1` to generate probabilistic forecasts
pred = model.predict(6, num_samples=100)
# shape : (forecast horizon, components, num_samples)
pred.all_values().shape
(6, 1, 100)
# showing the first 3 samples for each timestamp
pred.all_values()[:,:,:3]
print(loss_logger.train_loss, loss_logger.val_loss)
@eye4got The solution is to create a new logger when you recreate the model:
...
model = TFTModel(**model_params)
lr_finder = model.lr_find(series=target, past_covariates=past_cov, future_covariates=future_cov)
lr = lr_finder.suggestion()
model_params['optimizer_kwargs'] = {'lr': lr}
model_params['pl_trainer_kwargs'] = {"callbacks": [LossLogger()]} # <-- New logger !!!
model = TFTModel(**model_params)
...
The model init wants to make a deep copy (copy.deepcopy
) of the logger and for some reason that fails with an already used logger (and Callbacks in general). But you propably want a new clean logger anyway if you reinitialize the model. Otherwhise you mixup logged values from lr_find with your real training
Hi there!
First of all thanks for such a wonderful library. I was gonna ask for a feature and that is to save the training/validation loss for the models. It would be nice for the deep learning models to see the training/validation loss over the number of epochs to be able to decide on the appropriate number of epochs. Something similar to the
model.history
is Keras would be nice. If you have already implemented it I would be thankful if you guide me, but I could not locate that in models' attributes.Thanks,