raspstephan / nwp-downscale

MIT License
36 stars 8 forks source link

Define evaluation metrics and compute baseline scores #11

Open raspstephan opened 3 years ago

raspstephan commented 3 years ago

Steps:

We need to define how we want to evaluate our forecasts. That means clearly defining the metrics as well as the region and timeframe to be used.

@HirtM we already talked about the train/valid split and decided that using one week per month for validation sounds like a good choice. So we can just take the first 7 days of each month. We still need to define an area to be evaluated. For this, let's have a look at the radar quality map and chose a region. And we should do the evaluation for a range of lead times, maybe up to 48h?

In terms of metrics, there are a ton of different options but we should restrict it to a few in order to keep things simple. We will evaluate deterministic as well as ensemble forecasts. Additionally, we want to look at statistics that describe the "realism" of our forecasts. Here are some suggestions from my side (probably too many...):

Deterministic

Probabilistic

Realism

Finally, we should have some baselines. The easiest one is simply a bilinear interpolation of the TIGGE forecast.

HirtM commented 3 years ago

That sounds like a good list! I was thinking of quite similar scores. I would vote for simple cell size distributions since RDF or any other cell size methods are not trivial to interpret.

48 h lead time for evaluation sounds good. Either we cut out 2 d after each 7d evaluation period from the training, or we reduce the lead time for the latter two of the 7 days, to strictly have different situations. But maybe we need to cut ~1 day anyway before and after each evaluation period because of correlated weather situations?

Regarding the baseline, we want to have both a CRM and another simple downscaling method?

raspstephan commented 3 years ago

I will try to organize the CRM data.

With regards to the overlap between train/valid, my intuition would be to ignore this for starters. But should this ever end up in a paper, we should do it properly. Hopefully we will have enough data to take an entire year for validation.

BTW, here is the xskillscore package which is quite nice, especially for ensemble metrics: https://xskillscore.readthedocs.io/en/stable/index.html

HirtM commented 3 years ago

As evaluation area, for a start we can use the whole domain, wherever the radar quality is good enough. Using a threshold of -1, we get the following criterion (rq top, selected area at bottom): image

HirtM commented 3 years ago

Steps:

raspstephan commented 3 years ago

Great, thanks. I moved the to do list up to the top, so that it shows up in the project.

HirtM commented 3 years ago

FSS is added. I have a few questions about the code structure, using classes etc., how to call the whole evaluation process. Let's discuss that on Thursday.

HirtM commented 3 years ago

Regarding the F1-score, I only implemented the binary version, again, different thresholds are possible. In principle, it would be possible to do the F1-score using multiple categories, not just two. But I am not sure this is what we want.

HirtM commented 3 years ago

First comparisons with 001-Upscale_valid Generator prediction showed improvements to our baselin in RMSE, FSS and F1-score (although the precip fields look a bit blurred.) Proper implementation for the evaluation routine (e.g. latitdue selection,... ) still required.

image

raspstephan commented 3 years ago

Some thoughts on the train/valid/test split now that we have 3 years of data (2018-2020).

At first, I thought, we can just do 2018/19 for training and 2020 for validation. But I think that manual overfitting could become an issue, especially if we use things like early stopping, and therefore it might be better to have a third dataset for testing. So my current solution is to use the first 6 days of each month in 2018/29 for validation during model training, and then only use 2020 for the external validation you have done.

Downside is that we lose 1/5 of our training data but I think this is the more proper approach. This leaves ups with 40k training samples, which is a lot but of course they are quire unevenly distributed.