Training and validation loss over the epochs

mehrdadfazli commented 2 years ago

Hi there!

First of all thanks for such a wonderful library. I was gonna ask for a feature and that is to save the training/validation loss for the models. It would be nice for the deep learning models to see the training/validation loss over the number of epochs to be able to decide on the appropriate number of epochs. Something similar to the model.history is Keras would be nice. If you have already implemented it I would be thankful if you guide me, but I could not locate that in models' attributes.

Thanks,

dennisbader commented 2 years ago

Hi @mehrdadfazli, you can write a custom PyTorch Lightning callback for that:

from pytorch_lightning.callbacks import Callback

class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []

    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))

    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.val_loss.append(float(trainer.callback_metrics["val_loss"]))

loss_logger = LossLogger()

model = SomeTorchForecastingModel(
    ...,
    nr_epochs_val_period=1,  # perform validation after every epoch
    pl_trainer_kwargs={"callbacks": [loss_logger]}
)

# fit must include validation set for "val_loss"
model.fit(...)

Note that this will give you one more element in the loss_logger.val_loss as the models perform a validation sanity check before training begins.

mehrdadfazli commented 2 years ago

Thank you for your prompt response @dennisbader. The callback seems an elegant way of getting the loss. However, I get the error below when I run my code for TCNModel or RNNModel.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_13862/2799376969.py in <module>
     13     likelihood=GaussianLikelihood(),
     14     nr_epochs_val_period=1,  # perform validation after every epoch
---> 15     pl_trainer_kwargs={"callbacks": [loss_logger]}
     16 #     model_name='DeepTCN-with-covars-test',
     17 #     force_reset=True,

~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/models/forecasting/forecasting_model.py in __call__(cls, *args, **kwargs)
     39     def __call__(cls, *args, **kwargs):
     40         cls.model_call = (args, kwargs)
---> 41         return super(ModelMeta, cls).__call__(*args, **kwargs)
     42 
     43 

~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/utils/torch.py in decorator(self, *args, **kwargs)
     68         with fork_rng():
     69             manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
---> 70             return decorated(self, *args, **kwargs)
     71 
     72     return decorator

~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/models/forecasting/tcn_model.py in __init__(self, input_chunk_length, output_chunk_length, kernel_size, num_filters, num_layers, dilation_base, weight_norm, dropout, likelihood, random_state, **kwargs)
    382         kwargs["output_chunk_length"] = output_chunk_length
    383 
--> 384         super().__init__(likelihood=likelihood, **kwargs)
    385 
    386         self.input_chunk_length = input_chunk_length

~/.conda/envs/darts_env/lib/python3.7/site-packages/darts/models/forecasting/torch_forecasting_model.py in __init__(self, likelihood, **kwargs)
   1310             The likelihood model to be used for probabilistic forecasts.
   1311         """
-> 1312         super().__init__(**kwargs)
   1313         self.likelihood = likelihood
   1314 

TypeError: __init__() got an unexpected keyword argument 'pl_trainer_kwargs'

dennisbader commented 2 years ago

You need to upgrade your darts version to 0.17.1

mehrdadfazli commented 2 years ago

Thank you so much, Dennis. That helped a lot.

turbotimon commented 1 year ago

@dennisbader it looks like this is a frequent question and i had it too (and spend some hours figure it out without knowing this issue), my suggestion is including this to the docs

Suggestion 1: Adding a example (similar to yours) after https://unit8co.github.io/darts/userguide/torch_forecasting_models.html#early-stop

Suggestion 2: Implementing a class LearningRateMonitor() and including it in the docs as well of course)

I'm happy to send you a PR for what ever you prefer. If Suggestion 2, please tell me where to put it (e.g. darts.torch.LearningRateMonitor. I prefer Suggestion 2 as it fits "Focus on simplicity and clarity for end users."

dennisbader commented 1 year ago

Hi @turbotimon, if you want a learning rate monitor, that is already implemented in PyTorch-Lightning here

PyTorch Lightning is installed with Darts, so you can simply import it and add it to your callbacks similar as shown above for LossLogger.

from pytorch_lightning.callbacks import LearningRateMonitor

It's a good idea to make users more aware of how they can use callbacks with Darts. Personally, I would prefer a small dedicated user guide on Callbacks for our TorchForecastingModel rather than adding it to the model docs (it's only getting bigger and bigger). And then we could reference this in the model docs.

This user guide would just make users aware that:

our TorchForecastingModels are built on PyTorch-Lightning -> users can use all their predefined callbacks
- hint them to some of the frequently used ones such as: EarlyStopping, LearningRateMonitor, ...
a simple example on how to define and a custom callback such as the LossLogger

For the moment I would not add these callbacks to the Darts library as it means we would need to maintain those as well, and I'd rather leave this to PyTorch-Lightning or the user.

turbotimon commented 1 year ago

Hi Dennis, sorry LearningRateMonitor was bad naming, what i meant was learning cuve (test/val loss) and not step size.

I agree with the extra user guide. I'll make a PR as soon as i find time with an extra page that covers the things you mentioned above.

..rather leave this to PyTorch-Lightning

i was really surprised that Lighting doesn't have already something like "LossLogger" to easy visualize a learning curve. Something i think is crucial in ML.

dennisbader commented 1 year ago

Sounds great, thanks @turbotimon.

dennisbader commented 1 year ago

@turbotimon, I opened a dedicated issue for this (#1576) along with some more information about where to add this information and what info it should cover.

94Sip commented 1 year ago

I would like to log losses to Sagemaker Experiments. But what I don't understand is how do I get the losses from the Callback "state"? Obviously TensorboardLogger picks them up from the LossLogger state, but I can't find where in the code that happens.

And if I pass in the Sagemaker "run" object to the Callback, I get a complaint that it can't be pickled (required by PL).

Can you provide some advice on how to hook up logging losses to something other than Tensorboard?

turbotimon commented 1 year ago

@94Sip there is now an example of an LossLogger(Callback) here: https://unit8co.github.io/darts/userguide/torch_forecasting_models.html?highlight=losslogger#example-of-custom-callback-to-store-losses

With this example you can retreive the losses via loss_logger.val_loss or loss_logger.train_loss. I hope this helps

eye4got commented 2 months ago

@turbotimon When I use the above example I get the following runtime error:

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001

I have installed: torch==2.3.1+cu121 numpy==1.26.4 darts==0.29.0 tensorboard==2.16.2

Here is my code, mostly just to prove I copy pasted the logger:

from pytorch_lightning.callbacks import Callback

class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []

    # will automatically be called at the end of each epoch
    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))

    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.val_loss.append(float(trainer.callback_metrics["val_loss"]))

loss_logger = LossLogger()

model_params = {
    'input_chunk_length': input_size,
    'output_chunk_length': horizon,
    'n_epochs': num_epochs,
    'batch_size': batch_size,
    'likelihood': GaussianLikelihood(),
    'use_static_covariates': False,
    'add_relative_index': True,
    'pl_trainer_kwargs': {"callbacks": [loss_logger]}
}

model = TFTModel(**model_params)

turbotimon commented 2 months ago

@eye4got not sure what happens here, but doesn't seems to me that this is related to the LossLogger. Can you reproduce this error with a small and independend example like e.g. this?

from darts.datasets import WeatherDataset
from darts.models import TFTModel
series = WeatherDataset().load()
# predicting atmospheric pressure
target = series['p (mbar)'][:100]
# optionally, past observed rainfall (pretending to be unknown beyond index 100)
past_cov = series['rain (mm)'][:100]
# future temperatures (pretending this component is a forecast)
future_cov = series['T (degC)'][:106]
# loss logger setup
class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []
    # will automatically be called at the end of each epoch
    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))
    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.val_loss.append(float(trainer.callback_metrics["val_loss"]))
loss_logger = LossLogger()
pl_trainer_kwargs = {"callbacks": [loss_logger]}
# by default, TFTModel is trained using a `QuantileRegression` making it a probabilistic forecasting model
model = TFTModel(
    input_chunk_length=6,
    output_chunk_length=6,
    n_epochs=5,
    nr_epochs_val_period=1,  # perform validation after every epoch
    pl_trainer_kwargs=pl_trainer_kwargs,
)
# future_covariates are mandatory for `TFTModel`
model.fit(target, past_covariates=past_cov, future_covariates=future_cov)
# TFTModel is probabilistic by definition; using `num_samples >> 1` to generate probabilistic forecasts
pred = model.predict(6, num_samples=100)
# shape : (forecast horizon, components, num_samples)
pred.all_values().shape
(6, 1, 100)
# showing the first 3 samples for each timestamp
pred.all_values()[:,:,:3]
print(loss_logger.train_loss, loss_logger.val_loss)

tested with:

darts==0.30.0
pytorch-lightning==2.3.1

eye4got commented 2 months ago

Hi @turbotimon

I have narrowed it down to some kind of interaction between the LossLogger and when you recreate the model. The following example will cause the error. Commenting out the pl_trainer_kwargs parameter, which links the loss logger, stops this error from being raised. Similarly, if I don't initialise the model, test the learning rate, and then reinitialise the model, the LossLogger works fine.

from darts.datasets import WeatherDataset
from darts.models import TFTModel
from pytorch_lightning.callbacks import Callback

series = WeatherDataset().load()
# predicting atmospheric pressure
target = series['p (mbar)'][:100]
# optionally, past observed rainfall (pretending to be unknown beyond index 100)
past_cov = series['rain (mm)'][:100]
# future temperatures (pretending this component is a forecast)
future_cov = series['T (degC)'][:106]
# loss logger setup
class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []
    # will automatically be called at the end of each epoch
    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))
    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.val_loss.append(float(trainer.callback_metrics["val_loss"]))
loss_logger = LossLogger()
pl_trainer_kwargs = {"callbacks": [loss_logger]}
# by default, TFTModel is trained using a `QuantileRegression` making it a probabilistic forecasting model
model_params = {
    'input_chunk_length': 6,
    'output_chunk_length': 6,
    'n_epochs': 5,
    'nr_epochs_val_period': 1,  # perform validation after every epoch
    'pl_trainer_kwargs': pl_trainer_kwargs,
}

model = TFTModel(**model_params)
lr_finder = model.lr_find(series=target, past_covariates=past_cov, future_covariates=future_cov)
lr = lr_finder.suggestion()

model_params['optimizer_kwargs'] = {'lr': lr}
model = TFTModel(**model_params)

# future_covariates are mandatory for `TFTModel`
model.fit(target, past_covariates=past_cov, future_covariates=future_cov)
# TFTModel is probabilistic by definition; using `num_samples >> 1` to generate probabilistic forecasts
pred = model.predict(6, num_samples=100)
# shape : (forecast horizon, components, num_samples)
pred.all_values().shape
(6, 1, 100)
# showing the first 3 samples for each timestamp
pred.all_values()[:,:,:3]
print(loss_logger.train_loss, loss_logger.val_loss)

turbotimon commented 2 months ago

@eye4got The solution is to create a new logger when you recreate the model:

...
model = TFTModel(**model_params)
lr_finder = model.lr_find(series=target, past_covariates=past_cov, future_covariates=future_cov)
lr = lr_finder.suggestion()

model_params['optimizer_kwargs'] = {'lr': lr}
model_params['pl_trainer_kwargs'] = {"callbacks": [LossLogger()]} # <-- New logger !!!

model = TFTModel(**model_params)
...

The model init wants to make a deep copy (copy.deepcopy) of the logger and for some reason that fails with an already used logger (and Callbacks in general). But you propably want a new clean logger anyway if you reinitialize the model. Otherwhise you mixup logged values from lr_find with your real training

unit8co / darts

Training and validation loss over the epochs #811