Define evaluation metrics and compute baseline scores

raspstephan commented 3 years ago

Steps:

[x] Interpolation baseline
[x] RMSE score
[x] FSS
[x] Compute further deterministic scores ( F1)
[x] Think about the design of how to call the different functions etc.
[x] Include cnn prediction in the evaluation
[x] Include HRRR-mask for evaluation in addition to radar quality
[ ] Literature research on commonly used thresholds/scales, ... for evaluation metrics (FSS, F1, ...)
[ ] Compute histograms and cell sizes
[ ] Deal with multiple lead times
[ ] Consider ensemble data
[ ] Compute ensemble scores
[ ] Include other baselines as option

We need to define how we want to evaluate our forecasts. That means clearly defining the metrics as well as the region and timeframe to be used.

@HirtM we already talked about the train/valid split and decided that using one week per month for validation sounds like a good choice. So we can just take the first 7 days of each month. We still need to define an area to be evaluated. For this, let's have a look at the radar quality map and chose a region. And we should do the evaluation for a range of lead times, maybe up to 48h?

In terms of metrics, there are a ton of different options but we should restrict it to a few in order to keep things simple. We will evaluate deterministic as well as ensemble forecasts. Additionally, we want to look at statistics that describe the "realism" of our forecasts. Here are some suggestions from my side (probably too many...):

Deterministic

RMSE
FSS at different thresholds (0.1, 1 and 5?) and some area. (Let's have a look at what's commonly used in literature)
F1 score (also requires thresholding but very commonly used)

Probabilistic

CRPS
Rank histograms

Realism

Precipitation amount histogram/spectra
Something about cell size and shape (RDF or simply cell size distribution)

Finally, we should have some baselines. The easiest one is simply a bilinear interpolation of the TIGGE forecast.

HirtM commented 3 years ago

That sounds like a good list! I was thinking of quite similar scores. I would vote for simple cell size distributions since RDF or any other cell size methods are not trivial to interpret.

48 h lead time for evaluation sounds good. Either we cut out 2 d after each 7d evaluation period from the training, or we reduce the lead time for the latter two of the 7 days, to strictly have different situations. But maybe we need to cut ~1 day anyway before and after each evaluation period because of correlated weather situations?

Regarding the baseline, we want to have both a CRM and another simple downscaling method?

raspstephan commented 3 years ago

I will try to organize the CRM data.

With regards to the overlap between train/valid, my intuition would be to ignore this for starters. But should this ever end up in a paper, we should do it properly. Hopefully we will have enough data to take an entire year for validation.

BTW, here is the xskillscore package which is quite nice, especially for ensemble metrics: https://xskillscore.readthedocs.io/en/stable/index.html

HirtM commented 3 years ago

As evaluation area, for a start we can use the whole domain, wherever the radar quality is good enough. Using a threshold of -1, we get the following criterion (rq top, selected area at bottom):

HirtM commented 3 years ago

Steps:

[x] Interpolation baseline
[x] RMSE score
[ ] Compute further deterministic scores (FSS, F1)
[ ] Think about the design of how to call the different functions etc.
[ ] Compute histograms and cell sizes
[ ] Deal with multiple lead times
[ ] Consider ensemble data
[ ] Compute ensemble scores
[ ] Include other baselines as option

raspstephan commented 3 years ago

Great, thanks. I moved the to do list up to the top, so that it shows up in the project.

HirtM commented 3 years ago

FSS is added. I have a few questions about the code structure, using classes etc., how to call the whole evaluation process. Let's discuss that on Thursday.

HirtM commented 3 years ago

Regarding the F1-score, I only implemented the binary version, again, different thresholds are possible. In principle, it would be possible to do the F1-score using multiple categories, not just two. But I am not sure this is what we want.

HirtM commented 3 years ago

First comparisons with 001-Upscale_valid Generator prediction showed improvements to our baselin in RMSE, FSS and F1-score (although the precip fields look a bit blurred.) Proper implementation for the evaluation routine (e.g. latitdue selection,... ) still required.

raspstephan commented 3 years ago

Some thoughts on the train/valid/test split now that we have 3 years of data (2018-2020).

At first, I thought, we can just do 2018/19 for training and 2020 for validation. But I think that manual overfitting could become an issue, especially if we use things like early stopping, and therefore it might be better to have a third dataset for testing. So my current solution is to use the first 6 days of each month in 2018/29 for validation during model training, and then only use 2020 for the external validation you have done.

Downside is that we lose 1/5 of our training data but I think this is the more proper approach. This leaves ups with 40k training samples, which is a lot but of course they are quire unevenly distributed.

raspstephan / nwp-downscale

Define evaluation metrics and compute baseline scores #11