Open jacobbieker opened 1 year ago
@jacobbieker this looks super good and clear.
Bonus: the one page could be a quick Markdown document with the plots saved as images and with a bit of text around them
From ECR register provided by @jacobbieker , a meta data file can be downloaded which contains nearly 1000 post codes of different pv sites and their attributes, and I assume the pv outputs of these sites can be extracted from pvoutput.org? @jacobbieker @peterdudfield
No, the only output is the file I shared with you, these sites wont be on pvoutput.org, its just data from that provider
ukpowernewtorks
provides an API for all their datasets, including for the one @jacobbieker has sent me the link to xlsx
file, here is the link to the dataset we need for the site @jacobbieker sent me
Using the api
link, i have written a simple script to check the status of api_repsone
and to get the first record of the dataset.
@jacobbieker Plotting of raw data and resampled 5-min interval data of first 100 entries, the resampling is done by considering mean of the 5-minute intervals.
I believe that data has not lost as there is little difference between both plots
Good job! Is there a chance you can plot the errors? So plot the top one divided by the bottom one?
Good job! Is there a chance you can plot the errors? So plot the top one divided by the bottom one?
Yes, we can do that, and as per your suggestion we can do that for each day in the entire time series, but bear in mind that even after resampling, the 5 min intervals are not contiguous!
Yes, so if the interval are not contiguous, then we should fill it, I think, with interpolation between the points that are not missing, so that the data is continuous.
@jacobbieker I think the data Bad Data
in this dataset is not a continuous data, rather categorical data, out of first thousand entries I have checked, there are only 26 unique Bad Data
values, what does it mean?
I think it means only those 26 timesteps are bad, not that the data values are bad. This resampling should make it continuous overall.
I think it means only those 26 timesteps are bad, not that the data values are bad. This resampling should make it continuous overall.
No, no, I should have put it in a clear way, what I meant was, for those first 1000 timesteps, the Bad Data
values have been repeated, what I simply did is that I just checked df['bad_data'].unique
for those first 1000 timesteps, I got some 26 values, meaning those Bad_data
values are not randomly given.
Is bad-data
something that is negative. Could you just remove that data? Would be good to get a count of them, ie. what % of them. Might want to remove a few hours around the bad data to be sure.
@peterdudfield @jacobbieker
That Bad Data
is one of the two columns in the csv file
we got from Dan by email, ocf-email
- link, the entire column consists of only negative values and np.zeros
, but these negative values
are repetitive, meaning they are discrete/categorical variables, not random values, it is possible to interpolate them in the missing time steps, but not by conventional statistical methods for example with mean
or median
etc.
Okay, if that's the case, then I would go with drop the data first with the bad data label, and then interpolate between the remaining ones to a 5 minutely timeseries, not with mean or median, but just an interpolation as a first pass
After resampling, the plots between original raw data and resampled data, the resampling of the 5-minutey intervals is done by grouping the timesteps falling under each 5 minute interval (eg: 11/01/2023 15:31:11 and 11/01/2023 15:34:23) and taking their minimum value.
Okay, could you do it without taking the minimum, just doing an interpolation of the values to 5 minutely values?
Okay, could you do it without taking the minimum, just doing an interpolation of the values to 5 minutely values?
Interpolation into 5 minutely intervals has been done! Please check the PR
Okay, I've left some more comments on there. If we choose to not resample, then in general the raw data shouldn't be touched, other than to remove the bad data values, as we want to keep the raw data as close to what is coming out of the solar farm as possible. And with resampling, the comments are in the PR, but generally, shouldn't round values to the resample period we want, as that will mess up the interpolation.
Okay, I've left some more comments on there. If we choose to not resample, then in general the raw data shouldn't be touched, other than to remove the bad data values, as we want to keep the raw data as close to what is coming out of the solar farm as possible. And with resampling, the comments are in the PR, but generally, shouldn't round values to the resample period we want, as that will mess up the interpolation.
Pleas check the PR. I have written an interpolation function and tested it.
Cool, nice job, could you share the output plots?
Cool, nice job, could you share the output plots?
These are the plots from randomly selected date
Whats the yaxis on these? 'bad data' seems a odd label?
It is the negative data values we got from the client in this email, in that there are some entries as Bad Data
alongside negative values, those negative values supposed to be representing delta change if I'm right!
It is the negative data values we got from the client in this email, in that there are some entries as
Bad Data
alongside negative values, those negative values supposed to be representing delta change if I'm right!
It would be great to get a yaxis with kw
on it? Do you think that is possible? you might have to add those delta values up to make kw.
I would have thought Bad_data is a flag of 0 or 1, depending if the data is good or not. People use this to mark bad data, but not delete it
The plot of cumulative sum over time for these values in a single day before and after interpolation. I asked Dan to ask them for more information on how we can interpret this data they provided.
It is the negative data values we got from the client in this email, in that there are some entries as
Bad Data
alongside negative values, those negative values supposed to be representing delta change if I'm right!It would be great to get a yaxis with
kw
on it? Do you think that is possible? you might have to add those delta values up to make kw.
It is the negative data values we got from the client in this email, in that there are some entries as
Bad Data
alongside negative values, those negative values supposed to be representing delta change if I'm right!It would be great to get a yaxis with
kw
on it? Do you think that is possible? you might have to add those delta values up to make kw.
It does seem that the values are only 0 or negative. We might need to ask how this delta becomes kw
as I would have assumed there would need to be some positive values in the data, which there isn't. Unless the 0 values are the max capacity or something, and its going down from there?
It is the negative data values we got from the client in this email, in that there are some entries as
Bad Data
alongside negative values, those negative values supposed to be representing delta change if I'm right!It would be great to get a yaxis with
kw
on it? Do you think that is possible? you might have to add those delta values up to make kw.It does seem that the values are only 0 or negative. We might need to ask how this delta becomes
kw
as I would have assumed there would need to be some positive values in the data, which there isn't. Unless the 0 values are the max capacity or something, and its going down from there?
Yea, perhaps we need to get on a call with them, something about this data is all over the place
Its not just some weird days is it? Are there normal days?
Do we have any site metdata to help us here? Like whats the capacity of the site?
Its not just some weird days is it? Are there normal days?
Do we have any site metdata to help us here? Like whats the capacity of the site?
yes, UKPN has metadata provided in their data base, it is accessed by api and the data is downloadable in JSON format, I have written a function to request through a url link, check this PR, and also we can manually download the metadata from here
The capacity is 39MW or 41.053MVA, and we have the location, and there isn't any normal days, in terms of, there are no non-negative values in the CSV we got.
Make plots of the raw data, resampled data, and other individual PV sites to see if the solar farms act the same as our other PV datasets
Detailed Description
This could go into a one-pager to give a quick overview of the dataset
Possible Implementation
Just a few simple plots of