peterdudfield commented 2 years ago

Which capacity of the PV system should we use? It is useful to normalise the PV data ready for ML and using the capacity makes sense. Importantly the training and prediction must use the same. Hence #691

The metadata for pvoutput,org and Passiv provide capacity values. Note that The pv capacity can degrade over time, so a static one might not be so good

1. Use the maximum values of the training set

Pros:

the system will be between 0 and 1 inclusive
ML model can be PV system agnostic, as data will be between 0 and 1.

Cons:

the predictions data might not be between 0 and 1 if not data in training (unlikely)
I did this with the first CNN model and had to increase lots of capacities of ~10 systems, and reduce lots of capacities of ~10 systems

2. Use metadata

Pros:

This number is constant, its given to us

Cons:

It could be way off the actual power produced.

3. Hybrid 1

use 1
adjust capacity if prediction data > 100%
adjust capacity if prediction data < 50% (over a hsitory of collecting it live)

4. Hybrid 2

use 2
If training data < 50%, then use 1
If training data > 100% then use 1
If prediction data > 100% then use 1
If over a week of data, (which includes good sunny days) data < 50%, then use 1 (These lower and upper bounds could change)

peterdudfield commented 2 years ago

@JackKelly and @jacobbieker would be interested to hear your thoughts

jacobbieker commented 2 years ago

I kind of like Hypbrid 2 as the option, so that we try to trust the labelled values, but if the data is wrong, we are correcting for it.

JackKelly commented 2 years ago

If I remember correctly, @dantravers has thought about this issue! I'd be keen to hear his thoughts!

I don't trust the metadata very much :slightly_smiling_face:. Not least because we don't know if the metadata "PV capacity" is the DC capacity; or the AC capacity; or the planned capacity; or the capacity of the grid connection :slightly_smiling_face:.

So I'd lean towards using a "robust" statistical method for inferring the max from the entire timeseries (not just the training set. But over the entire dataset.). e.g. take the 99th percentile (i.e. ignore outliers) as the "max", and then clip the power data at that "max" (to guarantee that the normalised value never exceeds 1).

On the topic of degredation over time: One way to address this is to give the ML model the age of the PV system (if we have it), rather than try to manually correct for the degredation by re-computing the PV capacity every, say, year. (One edge-case where we might actually want to re-compute the "max" is large PV farms where they might install, say, 50% of the farm one month, and the other 50% another month.)

As a separate issue, I think we also need to come up with a set of algorithms for identifying "dud" PV data (e.g. generating at night. Or generating very little on a sunny day). I'm pretty sure that "bad" PV data is hurting us in multiple ways!

peterdudfield commented 2 years ago

Happy to come up with a statistical method. Just need to make sure we are making it robust for when we are collecting more data for prediction. We don't want one collected value to make us retrain the entire model. But perhaps we can clean up the re-training, so its more like an iterative process, so maybe we will be ok.

I like the 99th precentile, that gets ride of any blips in the data.

dantravers commented 2 years ago

You're right, @JackKelly - I've thought about this a decent amount!
I was using capacity for slightly different purposes - more on the post-analysis of results, so I was more interested in preserving the actual installed capacity (to get true yield values). In the ML forecast model, if you don't mind what the normalisation is, then the true capacity matters less (more than you range [0, 1])

But roughly: I found there were a number of systems which had crazy high individual half hours with high values way above what was possible. To remediate this I looked at the right side tail for outliers by taking a high percentile (Y% - I used about 99.9%) and then looking at if the max outturn value was more than x% (I used 20%, but up for discussion) higher than that value.
Case 1: If this wasn't the case, then I was happy to believe the right-side tail was valid. In this case we could normalize by the max outturn.
Case 2: If this was the case, then I threw away the erroneous values. In our situation, we could cap at the Y%.

One option would be to do Case 2 all the time, and cap at Y%, whcih I think is OK, but if you look at the distribution of outturn values, and very often there is interesting stuff happening in the top of the range, which I would be loath to lose unless we had to.

Note: I'm not sure if this will have any bearing on the ML, but the actual installed capacity values will be quite different from what we are calcualting here - and it will depend on the orientation of the systems. I.e. north facing will have higher installed capacity than south facing for the same observed generation figures.

peterdudfield commented 2 years ago

Perhaps a good way to do this would be to have

installed_capacity (from provider)
installed_capacity_calculated,

Then we store both and use the 'calculated_installed_capacity' for ML tasks.

dantravers commented 2 years ago

Yes, we should definitely store the nameplate capacity, and we may want to add multiple other fields as they are cheap to store (max_outturn, %_outturn, calculated_capacity, etc) so we can understand and maybe use the intermediate values.

From: Peter Dudfield @.> Sent: 29 June 2022 13:00 To: openclimatefix/nowcasting_dataset @.> Cc: dantravers @.>; Mention @.> Subject: Re: [openclimatefix/nowcasting_dataset] Which Capacity should we use? (Issue #692)

Perhaps a good way to do this would be to have

installed_capacity (from provider)
installed_capacity_calculated,

Then we store both and use the 'calculated_installed_capacity' for ML tasks.

— Reply to this email directly, view it on GitHub https://github.com/openclimatefix/nowcasting_dataset/issues/692#issuecomment-1169893547 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AJGGYO5J6TWBLBU2C355DOTVRQ3CXANCNFSM52CQXFBA . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AJGGYO2DBULIDOWCU5T5OWDVRQ3CXA5CNFSM52CQXFBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIW5SRKY.gif Message ID: @. @.> >

peterdudfield commented 2 years ago

I would be tempted to try to keep the database it clean so we store nameplate capacity and our ML capacity. The other things we can get using analysis of the data.

I realise it was not obvious from above I was talking about the database - sorry about that

openclimatefix / nowcasting_dataset

Which Capacity should we use? #692

1. Use the maximum values of the training set

2. Use metadata

3. Hybrid 1

4. Hybrid 2