[Question] Validation metric different when trained model is rerun on validation set

tniveej commented 4 months ago

Hey guys, I'm facing a problem that's been driving me nuts. I am a beginner so please forgive me if there are any fundamental mistakes here. Any help is appreciated.

I am using the TiDE model to try and do some prediction (regression) on Timeseries'. I believe what I'm trying to do is transfer learning. I have multiple Timeseries' that I want to train the model on and make a prediction on a different set of similar Timeseries'. When I run the training, the model has a MAE ≈ 0.01. However, when I make the model predict the validation sets of each Timeseries from training and manually calculate the MAE its more like MAE ≈ 0.18. The model also struggles to make any proper predictions (as I show near the end).

The nature of my data is as follows :

There is one variable I would like to predict (I actually have two I would like to predict but to simplify, I'm testing it on one first)
There are a total of 2 static covariates for each Timeseries
There are 44 covariates for each time-step of the time series which are input into the model as future covariates as they are known in their entirety when predicting.

Now getting to the code. This is what I've done :

Model parameters:

device = "gpu" if torch.cuda.is_available() else "cpu"

# this setting stops training once the the validation loss has not decreased by more than 1e-5 for 10 epochs
early_stopping_args = {
    "monitor": "val_loss",
    "patience": 10,
    "min_delta": 1e-5,
    "mode": "min",
    "divergence_threshold": 0.8,
    "verbose": True,
}

# PyTorch Lightning Trainer arguments
pl_trainer_kwargs = {
    "max_epochs": 200,
    "accelerator": device,
    "callbacks": [
        EarlyStopping(
            **early_stopping_args,
        )
    ],
    "gradient_clip_val": 1,
}

# learning rate scheduler
lr_scheduler_cls = torch.optim.lr_scheduler.ExponentialLR
lr_scheduler_kwargs = {
    "gamma": 0.999,
}

#
model_args = {
    "input_chunk_length": 10,  # lookback window
    "output_chunk_length": 1,  # forecast/lookahead window
    "pl_trainer_kwargs": pl_trainer_kwargs,
    "lr_scheduler_cls": lr_scheduler_cls,
    "lr_scheduler_kwargs": lr_scheduler_kwargs,
    "likelihood": None,  # use a likelihood for probabilistic forecasts
    "save_checkpoints": True,  # checkpoint to retrieve the best performing model state,
    "force_reset": True,  # If set to True, any previously-existing model with the same name will be reset (all checkpoints will be discarded). Default: False.
    "batch_size": 32,
    "use_static_covariates": True,
    "random_state": 42,
    "hidden_size": 1024,
    "num_encoder_layers": 2,
    "num_decoder_layers": 4,
    "decoder_output_dim": 64,
    "temporal_decoder_hidden": 64,
    "dropout": 0.1,
    "use_layer_norm": True,
    "use_reversible_instance_norm": False,
    "temporal_width_past": 42,
    "temporal_width_future": 43,
}

dataloader_args = {
    "drop_last": True,
}

Then I create the timeseries from a csv and fill in the missing values using forward fill manually before turning them into a list of timeseries. I then normalized the data using the default MinMax scaler. Here is an example of a timeseries plotted. timeseries to predict :

covariates for said Timeseries :

Next I find a learning rate. The reason it's a function that iterates until it doesn't fail is because I reused the code from where I did hyperparameter tuning to find the optimal model parameters and I didn't want the trial to fail because it couldn't find a suitable learning rate:

model_tide = TiDEModel(
    **model_args,
    # log_tensorboard = True,
    model_name="Tide_best",
    loss_fn=torch.nn.MSELoss(),
)

def find_lr(model):
    max_lr = 0.1
    while True:
        try:
            lr_results = model.lr_find(
                series=train,
                future_covariates=train_cov,
                val_series=val,
                val_future_covariates=val_cov,
                dataloader_kwargs={
                    "drop_last": True,
                },
                max_samples_per_ts=200,
                min_lr=1e-08,
                max_lr=max_lr,
                verbose=True,
            )
            return lr_results.suggestion()

        except Exception:
            print("lr too big")
            max_lr = max_lr / 10

best_lr = find_lr(model_tide)
print(best_lr)

Next I go ahead to the training :

torch_metrics = MetricCollection(
    [MeanSquaredError(), MeanAbsoluteError(), MeanAbsolutePercentageError()]
)

pl_trainer_kwargs["callbacks"] = [
    EarlyStopping(
        **early_stopping_args,
    )
]

model_tide = TiDEModel(
    **model_args,
    model_name="Tide_best",
    log_tensorboard=True,
    loss_fn=torch.nn.MSELoss(),
    torch_metrics=torch_metrics,
    optimizer_kwargs={"lr": best_lr, "weight_decay": 0},
)

model_tide.fit(
    series=train,
    future_covariates=train_cov,
    val_series=val,
    val_future_covariates=val_cov,
    dataloader_kwargs=dataloader_args,
    # max_samples_per_ts = 150,
    verbose=True,
)

Thre training is whacky because it seems like there is no decrease in loss, just a fluctuation. Also, for some reason it always stops at the number of epochs set by the patience parameter of the EarlyStopping callback; 10 epoch in this case. But I think we can ignore that for now (?) Here's are pictures showing the training loss and the validation loss + MAE

Because I noticed even with a low MAE, the model is not able to make any meaningful predictions, I went ahead and tried to do the prediction on the validation sets used doing training on the trained model :

mae_list = []
points = 0
loaded_model = TiDEModel.load_from_checkpoint(
    model_name="Tide_best", best=True, log_tensorboard=False
)

for i, val_ts in enumerate(tqdm(val)):
    # provied all the covariates including the ones from the training set
    cov = train_cov[i].append(val_cov[i])
    len_pred = len(val_ts)

    predictions = loaded_model.predict(
        #using the training set as past values and predicting the validation set
        n=len_pred, series=(train[i]), future_covariates=cov, verbose=False
    )

    mean_abs_err = mae(val_ts, predictions, intersect=True) * (len_pred)
    points = points + len_pred

    mae_list.append(mean_abs_err)

    # some random timeseries to visualize
    if i == 180:
        # print(predictions.values())
        predictions.plot(label="Predictions")
        val_ts.plot(label="Actual")

print(f"Mean Absolute Error : {np.sum(mae_list)/points}")

I get MAE = 0.1881085145800395 which is way off the values obtained during training. An example of the prediction made by the model (the same Timeseries from the example dataset shown above)

I've been at this for some time and I still can't figure out what's going wrong. Can someone explain to me what I'm doing wrong here?

dennisbader commented 4 months ago

Hi @tniveej. Here some general info about why you get different results on the validation set between training and your prediction loop:

Your model uses output_chunk_length=1, meaning the model is trained to only predict 1 step in one forward pass. This means when evaluating during training, the samples are made from all possible input (input_chunk_length=10) / output (output_chunk_length = 1) windows. So the model only evaluates on these 1 step forecasts. This would be equivalent to performing historical forecasts on the validation set with forecast_horizon=1 and stride=1. In your example you call predict() with n=len(val_ts) which is far larger than 1. The model then performs auto-regression - consuming its own predictions and future values of covariates to generate new predictions further in the future. So the performance is expected to be worse with this since your model hasn't been trained to do it.

Also some other tipps:

it looks like your model is overfitting
by default the evaluation is performed after every epoch. It seems in your case, 1 epoch contains around 10k batches (steps). I believe you should adapt how often the evaluation set is checked, since it seems to be overfitting quite fast, and checking after every epoch likely misses the actual (local) minima. You can adapt the evaluation interval with val_check_interval in the pl_trainer_kwargs dict. Also, the PyTorch Lightning Trainer offers a lot of customization of the training process (see all parameters here).
note that the EarlyStopping acts on each evaluation set check, so if you adapt the interval, you might also consider adapting the stopping conditions.

tniveej commented 4 months ago

Hello @dennisbader . Thank you very much for the insight and tips.

From my data, I am trying to get the model to predict a single Timeseries' (I have many series') in its entirety from a cold start (essentially with an input_chunk_length = 0) using only the provided covariate data. So basically, transfer learning from the training data on new Timeseries's. However, I realize this is not possible with the torch forecasting models in the darts library (p.s. check edit below). Therefore, the idea was to first train a model with a short enough input_chunk_length and all the features (covariates). Then when it comes to prediction time, I would provide an average value for the first input_chunk_length (e.g. average value over the training set of the Timeseries to be predicted) and have the model auto-regressively predict the rest in hopes that the information from the covariate data would be enough to correct the model predictions.

I realize that my model is very quickly overfitting therefore I've tried a few of the things on top of your suggestions:

I tried evaluating the validation set after every 100 steps with early stopping changed, the model still seemed to not learn much as there is a drop in the loss after a couple of steps and does not learn anymore. The validation loss also seems to fluctuate around the same amount.
I also tried playing around with the input_chunk_length and output_chunk_length to no avail.
I then tried increasing the dropout and weight decay to further increase regularization, but same results.
I also reduced the model complexity in hopes that it won't overfit so quickly, and still the same result. Still the same quick overfitting although now with worse metrics.
I also tried pruning the features that were input into the future_covariates from 44 features to 12 features that I felt were the most important to the model prediction (also based off SHAP values that I obtained from training a basic MLP on the same dataset). Same result

Do you think there is anything more I could be doing to improve how the training goes? Or would this indicate that the covariate data provided just does not have enough information to predict the desired variable? And is my goal a bit too ambitious and un-achievable with current methods?

Edit: I just found that covariate only prediction is possible with RegressionModels from your comment in #2473. I will be trying that out next.

unit8co / darts

[Question] Validation metric different when trained model is rerun on validation set #2472