Eval: missing real generation_power values

JasonFengGit commented 7 months ago

Describe the bug

In evaluation, some of the real/expected values of generation_power are missing.

To Reproduce

Steps to reproduce the behavior:

Run python scripts/run_evaluation.py with the following testset.csv (a small test to illustrate the bug):
```
pv_id,timestamp
9531,2021-05-08 10:00:00
```

Some values missing in results.csv in the generation_power columns Example results.csv:

,forecast_power,horizon_hour,pv_id,timestamp,generation_power
0,0.5382338261787198,0,9531,2021-05-08 10:00:00,
1,0.6805504837540712,1,9531,2021-05-08 11:00:00,
2,0.6950511506600507,2,9531,2021-05-08 12:00:00,
3,0.7507192765284325,3,9531,2021-05-08 13:00:00,
4,0.6222327619232007,4,9531,2021-05-08 14:00:00,
5,0.46010747864610435,5,9531,2021-05-08 15:00:00,
6,0.2792985706278065,6,9531,2021-05-08 16:00:00,
7,0.11883538094408863,7,9531,2021-05-08 17:00:00,0.19273080444335938
8,0.03377143967258781,8,9531,2021-05-08 18:00:00,0.05239992141723633
9,0.004003063439732276,9,9531,2021-05-08 19:00:00,0.0
10,0.0,10,9531,2021-05-08 20:00:00,0.0
11,0.0,11,9531,2021-05-08 21:00:00,0.0
12,0.0,12,9531,2021-05-08 22:00:00,0.0
13,0.0,13,9531,2021-05-08 23:00:00,0.0
14,0.0,14,9531,2021-05-09 00:00:00,0.0
15,0.0,15,9531,2021-05-09 01:00:00,0.0
16,0.0,16,9531,2021-05-09 02:00:00,0.0
17,0.0,17,9531,2021-05-09 03:00:00,0.0
18,0.0006960749166189652,18,9531,2021-05-09 04:00:00,0.0
19,0.021830932182701164,19,9531,2021-05-09 05:00:00,0.002466707944869995
20,0.04920016630787139,20,9531,2021-05-09 06:00:00,0.12896760559082032
21,0.16425460389406232,21,9531,2021-05-09 07:00:00,0.22877279663085937
22,0.2536578989915163,22,9531,2021-05-09 08:00:00,0.8414171752929688
23,0.3202140667660062,23,9531,2021-05-09 09:00:00,0.6911544189453125
24,0.6471341332970747,24,9531,2021-05-09 10:00:00,0.8355504150390625
25,0.7728203006501675,25,9531,2021-05-09 11:00:00,1.15409765625
26,0.6856276972650501,26,9531,2021-05-09 12:00:00,0.6737999877929688
27,0.7735971877911895,27,9531,2021-05-09 13:00:00,1.11731640625
28,0.6681219518935074,28,9531,2021-05-09 14:00:00,0.20179200744628906
29,0.49810158614186933,29,9531,2021-05-09 15:00:00,0.45828359985351563
30,0.3536980181332593,30,9531,2021-05-09 16:00:00,0.35039999389648435
31,0.19379396872601617,31,9531,2021-05-09 17:00:00,0.2593247985839844
32,0.05294271353381089,32,9531,2021-05-09 18:00:00,0.17835600280761718
33,0.00577927292344424,33,9531,2021-05-09 19:00:00,0.07551947784423828
34,0.0,34,9531,2021-05-09 20:00:00,3.235164058423834e-09
35,0.0,35,9531,2021-05-09 21:00:00,0.0
36,0.0,36,9531,2021-05-09 22:00:00,0.0
37,0.0,37,9531,2021-05-09 23:00:00,0.0
38,0.0,38,9531,2021-05-10 00:00:00,0.0
39,0.0,39,9531,2021-05-10 01:00:00,0.0
40,0.0,40,9531,2021-05-10 02:00:00,0.0
41,0.0,41,9531,2021-05-10 03:00:00,0.0
42,0.0016835594981394644,42,9531,2021-05-10 04:00:00,0.0
43,0.04807132423975142,43,9531,2021-05-10 05:00:00,0.01917263984680176
44,0.2019059924841576,44,9531,2021-05-10 06:00:00,0.20261639404296874
45,0.4591377241020738,45,9531,2021-05-10 07:00:00,0.33280679321289064
46,0.7547477658079034,46,9531,2021-05-10 08:00:00,0.34174200439453123
47,1.068172900817906,47,9531,2021-05-10 09:00:00,0.9841751708984375

Expected behavior

No missing values (or maybe some fallbacks to handle missing values).

peterdudfield commented 7 months ago

Thanks @JasonFengGit for this

We'll have to think how to perhaps create a new test dataset that doesnt have any missing generation values

JasonFengGit commented 7 months ago

We could filter out timestamps with missing values, but that would introduce some biases that are hard to analyze.

peterdudfield commented 7 months ago

We could filter out timestamps with missing values, but that would introduce some hard to explain bias.

I think we could filter out the missing ones, and introduce new ones. As long as we then do some analysis on the new test set and check its not bias, then it should be ok.

What bias' were you thinking about?

JasonFengGit commented 7 months ago

For example, the missing values might be due to similar reasons and could share some patterns that are either easier or harder to predict, thereby making the evaluation biased.

peterdudfield commented 7 months ago

For example, the missing values might be due to similar reasons and could share some patterns that are either easier or harder to predict, thereby making the evaluation biased.

ah I see, from what I've seen, there are normally quite random as they are all random pv panels throughout the UK. But we can check this

JasonFengGit commented 7 months ago

Oh OK! That would make it easier.

zakwatts commented 6 months ago

@JasonFengGit Nice spot! thanks for this

openclimatefix / open-source-quartz-solar-forecast