openclimatefix / pv-solar-farm-forecasting

Forecasting for individual solar farms
MIT License
0 stars 0 forks source link

Add Plots of Raw and Resampled Data #1

Open jacobbieker opened 1 year ago

jacobbieker commented 1 year ago

Make plots of the raw data, resampled data, and other individual PV sites to see if the solar farms act the same as our other PV datasets

Detailed Description

This could go into a one-pager to give a quick overview of the dataset

Possible Implementation

Just a few simple plots of

peterdudfield commented 1 year ago

@jacobbieker this looks super good and clear.

Bonus: the one page could be a quick Markdown document with the plots saved as images and with a bit of text around them

vrym2 commented 1 year ago

From ECR register provided by @jacobbieker , a meta data file can be downloaded which contains nearly 1000 post codes of different pv sites and their attributes, and I assume the pv outputs of these sites can be extracted from pvoutput.org? @jacobbieker @peterdudfield

jacobbieker commented 1 year ago

No, the only output is the file I shared with you, these sites wont be on pvoutput.org, its just data from that provider

vrym2 commented 1 year ago

ukpowernewtorks provides an API for all their datasets, including for the one @jacobbieker has sent me the link to xlsx file, here is the link to the dataset we need for the site @jacobbieker sent me

Using the api link, i have written a simple script to check the status of api_repsone and to get the first record of the dataset.

vrym2 commented 1 year ago

image

image

@jacobbieker Plotting of raw data and resampled 5-min interval data of first 100 entries, the resampling is done by considering mean of the 5-minute intervals.

I believe that data has not lost as there is little difference between both plots

jacobbieker commented 1 year ago

Good job! Is there a chance you can plot the errors? So plot the top one divided by the bottom one?

vrym2 commented 1 year ago

Good job! Is there a chance you can plot the errors? So plot the top one divided by the bottom one?

Yes, we can do that, and as per your suggestion we can do that for each day in the entire time series, but bear in mind that even after resampling, the 5 min intervals are not contiguous!

jacobbieker commented 1 year ago

Yes, so if the interval are not contiguous, then we should fill it, I think, with interpolation between the points that are not missing, so that the data is continuous.

vrym2 commented 1 year ago

@jacobbieker I think the data Bad Data in this dataset is not a continuous data, rather categorical data, out of first thousand entries I have checked, there are only 26 unique Bad Data values, what does it mean?

jacobbieker commented 1 year ago

I think it means only those 26 timesteps are bad, not that the data values are bad. This resampling should make it continuous overall.

vrym2 commented 1 year ago

I think it means only those 26 timesteps are bad, not that the data values are bad. This resampling should make it continuous overall.

No, no, I should have put it in a clear way, what I meant was, for those first 1000 timesteps, the Bad Data values have been repeated, what I simply did is that I just checked df['bad_data'].unique for those first 1000 timesteps, I got some 26 values, meaning those Bad_data values are not randomly given.

peterdudfield commented 1 year ago

Is bad-data something that is negative. Could you just remove that data? Would be good to get a count of them, ie. what % of them. Might want to remove a few hours around the bad data to be sure.

vrym2 commented 1 year ago

@peterdudfield @jacobbieker That Bad Data is one of the two columns in the csv file we got from Dan by email, ocf-email - link, the entire column consists of only negative values and np.zeros, but these negative values are repetitive, meaning they are discrete/categorical variables, not random values, it is possible to interpolate them in the missing time steps, but not by conventional statistical methods for example with mean or median etc.

jacobbieker commented 1 year ago

Okay, if that's the case, then I would go with drop the data first with the bad data label, and then interpolate between the remaining ones to a 5 minutely timeseries, not with mean or median, but just an interpolation as a first pass

vrym2 commented 1 year ago

After resampling, the plots between original raw data and resampled data, the resampling of the 5-minutey intervals is done by grouping the timesteps falling under each 5 minute interval (eg: 11/01/2023 15:31:11 and 11/01/2023 15:34:23) and taking their minimum value.

image

image

jacobbieker commented 1 year ago

Okay, could you do it without taking the minimum, just doing an interpolation of the values to 5 minutely values?

vrym2 commented 1 year ago

Okay, could you do it without taking the minimum, just doing an interpolation of the values to 5 minutely values?

Interpolation into 5 minutely intervals has been done! Please check the PR

jacobbieker commented 1 year ago

Okay, I've left some more comments on there. If we choose to not resample, then in general the raw data shouldn't be touched, other than to remove the bad data values, as we want to keep the raw data as close to what is coming out of the solar farm as possible. And with resampling, the comments are in the PR, but generally, shouldn't round values to the resample period we want, as that will mess up the interpolation.

vrym2 commented 1 year ago

Okay, I've left some more comments on there. If we choose to not resample, then in general the raw data shouldn't be touched, other than to remove the bad data values, as we want to keep the raw data as close to what is coming out of the solar farm as possible. And with resampling, the comments are in the PR, but generally, shouldn't round values to the resample period we want, as that will mess up the interpolation.

Pleas check the PR. I have written an interpolation function and tested it.

jacobbieker commented 1 year ago

Cool, nice job, could you share the output plots?

vrym2 commented 1 year ago

Cool, nice job, could you share the output plots?

These are the plots from randomly selected date

image

image

peterdudfield commented 1 year ago

Whats the yaxis on these? 'bad data' seems a odd label?

vrym2 commented 1 year ago

It is the negative data values we got from the client in this email, in that there are some entries as Bad Data alongside negative values, those negative values supposed to be representing delta change if I'm right!

peterdudfield commented 1 year ago

It is the negative data values we got from the client in this email, in that there are some entries as Bad Data alongside negative values, those negative values supposed to be representing delta change if I'm right!

It would be great to get a yaxis with kw on it? Do you think that is possible? you might have to add those delta values up to make kw.

peterdudfield commented 1 year ago

I would have thought Bad_data is a flag of 0 or 1, depending if the data is good or not. People use this to mark bad data, but not delete it

vrym2 commented 1 year ago

The plot of cumulative sum over time for these values in a single day before and after interpolation. I asked Dan to ask them for more information on how we can interpret this data they provided.

image

image

It is the negative data values we got from the client in this email, in that there are some entries as Bad Data alongside negative values, those negative values supposed to be representing delta change if I'm right!

It would be great to get a yaxis with kw on it? Do you think that is possible? you might have to add those delta values up to make kw.

jacobbieker commented 1 year ago

It is the negative data values we got from the client in this email, in that there are some entries as Bad Data alongside negative values, those negative values supposed to be representing delta change if I'm right!

It would be great to get a yaxis with kw on it? Do you think that is possible? you might have to add those delta values up to make kw.

It does seem that the values are only 0 or negative. We might need to ask how this delta becomes kw as I would have assumed there would need to be some positive values in the data, which there isn't. Unless the 0 values are the max capacity or something, and its going down from there?

peterdudfield commented 1 year ago

It is the negative data values we got from the client in this email, in that there are some entries as Bad Data alongside negative values, those negative values supposed to be representing delta change if I'm right!

It would be great to get a yaxis with kw on it? Do you think that is possible? you might have to add those delta values up to make kw.

It does seem that the values are only 0 or negative. We might need to ask how this delta becomes kw as I would have assumed there would need to be some positive values in the data, which there isn't. Unless the 0 values are the max capacity or something, and its going down from there?

Yea, perhaps we need to get on a call with them, something about this data is all over the place

peterdudfield commented 1 year ago

Its not just some weird days is it? Are there normal days?

Do we have any site metdata to help us here? Like whats the capacity of the site?

vrym2 commented 1 year ago

Its not just some weird days is it? Are there normal days?

Do we have any site metdata to help us here? Like whats the capacity of the site?

yes, UKPN has metadata provided in their data base, it is accessed by api and the data is downloadable in JSON format, I have written a function to request through a url link, check this PR, and also we can manually download the metadata from here

jacobbieker commented 1 year ago

The capacity is 39MW or 41.053MVA, and we have the location, and there isn't any normal days, in terms of, there are no non-negative values in the CSV we got.