openclimatefix / PVNet

PVnet main repo
MIT License
17 stars 4 forks source link

Validation step of PVNet training - error in RMSE/MAE loss calculation #104

Closed confusedmatrix closed 9 months ago

confusedmatrix commented 9 months ago

When running PVNet training on premade data batches, I came across an issue related to calculating the validation RMSE/MAE losses in pvnet/models/base_model.py : BaseModel._calculate_val_losses method.

https://github.com/openclimatefix/PVNet/blob/adeb778bad2ceb887a2664c7f3c6f8c63636cc5d/pvnet/models/base_model.py#L373C9-L378C92

Error trace:

  File "/Users/chris/opt/miniconda3/envs/pvnet/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 403, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/Users/chris/Dev/ocf/PVNet/pvnet/models/base_model.py", line 444, in validation_step
    losses.update(self._calculate_val_losses(y, y_hat))
  File "/Users/chris/Dev/ocf/PVNet/pvnet/models/base_model.py", line 373, in _calculate_val_losses
    common_metrics_each_step = common_metrics(predictions=y_hat, target=y)
  File "/Users/chris/opt/miniconda3/envs/pvnet/lib/python3.9/site-packages/ocf_ml_metrics/metrics/errors.py", line 28, in common_metrics
    error_dict[tag + "mae"] = np.mean(np.abs(predictions - target))
  File "/Users/chris/opt/miniconda3/envs/pvnet/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 3502, in mean
    return mean(axis=axis, dtype=dtype, out=out, **kwargs)
TypeError: mean() received an invalid combination of arguments - got (out=NoneType, dtype=NoneType, axis=NoneType, ), but expected one of:
 * (*, torch.dtype dtype)
 * (tuple of ints dim, bool keepdim, *, torch.dtype dtype)
 * (tuple of names dim, bool keepdim, *, torch.dtype dtype)

common_metrics is expecting np.ndarrays but we're passing torch tensors. Recasting the tensors using .numpy() works to avoid this error, but produces a single float result for each metric type (RMSE & MAE). We should however. be expecting a 1-day array (each value representing the metric for a specific forecast horizon). So the solution to this would be to specify the axis in the mean calculation in common_metrics to calculate the mean over the batch independently for each forecast horizon.

Additionally, there was a typo in the call to the ocf-ml-metrics library to call common_metrics which I'll create a PR for in a moment.