Open michaelgambold opened 2 years ago
The information that we want to show is the breakdown of green vs fossil fuels. Some sample data would be the following for a given timestamp (ignoring the missing data issue for the moment).
// 1 minute data/raw data resolution
{
coal: 100,
gas: 200,
solar: 50,
hydro: 150
}
// we would expect the following
{
fossil: 300,
green: 200
}
This is easy as grouping by "green" fuel's and performing sum(value)
is easy.
When we deal with with aggregated data there can be data gaps that occur. This means that for a given series/fuel (say coal and gas) the min(timestamp)
may have different values when looking at a certain period (say 1 hour). E.g. the `min(timestamp) for the previously mentioned fuels they could be 10:00 and 10:15. This means you cannot simply add the aggregated series value together as time timestamps are different.
Although we could do some fuzzy logic to work out if we should add them together this would need to be aware of what period is selected to add it to the correct value. I.e. for 1 hour period any minute in that hour gets added to the hour and for 15 minute period anything from 0 to 15 would be the first quarter and so on.
It would be better if this was handled in the backend. However this get's difficult when dealing with time periods that are not "raw". When looking at individual series we can simply take the average of the fuels value over that period. We want the sum of the average however the same problem of aligning timestamps exists.
To make it more resilient we can `sum(value) for the given period. This would avoid the issue of the timestamp's not aligning but would the values huge the larger the period (eg 1 Day).
What we can do is convert the sum to a percentage (i.e. sum(green values) / sum(all values)
and do the same for "fossil" and "unknown" fuel types.
Regardless of the period and data gaps this would give an accurate green vs fossil vs unknown breakdown.
A sample SQL statement to get the totals for hourly period data. It uses sub queries to get the green, fossil, unknown and all sum's. However on sample data this takes 4.6 seconds to run for 769 timestamps/30,150 records. This data would then need to be mapped and the percentage calculated (i.e. fossilSum / allSum
).
select
min(timestamp) as timestamp,
df.region_id as "regionId",
df.power_id as "powerId",
(
select sum(value)
from data_fact as sdf
join date_dimension as sdd on sdf.date_id = sdd.id
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdd.year = dd.year
and sdd.month_number = dd.month_number
and sdd.day_of_month = dd.day_of_month
and std.hour = td.hour
and sfd.type = 'green'
) as "greenSum",
(
select sum(value)
from data_fact as sdf
join date_dimension as sdd on sdf.date_id = sdd.id
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdd.year = dd.year
and sdd.month_number = dd.month_number
and sdd.day_of_month = dd.day_of_month
and std.hour = td.hour
and sfd.type = 'fossil'
) as "fossilSum",
(
select sum(value)
from data_fact as sdf
join date_dimension as sdd on sdf.date_id = sdd.id
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdd.year = dd.year
and sdd.month_number = dd.month_number
and sdd.day_of_month = dd.day_of_month
and std.hour = td.hour
and sfd.type = 'unknown'
) as "unknownSum",
(
select sum(value)
from data_fact as sdf
join date_dimension as sdd on sdf.date_id = sdd.id
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdd.year = dd.year
and sdd.month_number = dd.month_number
and sdd.day_of_month = dd.day_of_month
and std.hour = td.hour
) as "allSum"
from data_fact as df
join date_dimension as dd on df.date_id = dd.id
join time_dimension as td on df.time_id = td.id
group by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id
order by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id
After testing a performance improvement can be made by not performing a sub query for the "allSum" data as this is what the parent query already has (i.e. no need to re-query it).
In my local dev environment this resulted in performance improvements as listed below (for given periods):
select
min(timestamp) as timestamp,
df.region_id as "regionId",
df.power_id as "powerId",
(
select sum(value)
from data_fact as sdf
join date_dimension as sdd on sdf.date_id = sdd.id
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdd.year = dd.year
and sdd.month_number = dd.month_number
and sdd.day_of_month = dd.day_of_month
and std.hour = td.hour
and sfd.type = 'green'
) as "greenSum",
(
select sum(value)
from data_fact as sdf
join date_dimension as sdd on sdf.date_id = sdd.id
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdd.year = dd.year
and sdd.month_number = dd.month_number
and sdd.day_of_month = dd.day_of_month
and std.hour = td.hour
and sfd.type = 'fossil'
) as "fossilSum",
(
select sum(value)
from data_fact as sdf
join date_dimension as sdd on sdf.date_id = sdd.id
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdd.year = dd.year
and sdd.month_number = dd.month_number
and sdd.day_of_month = dd.day_of_month
and std.hour = td.hour
and sfd.type = 'unknown'
) as "unknownSum",
sum(value) as "allSum" <----- sum here instead of sub query
from data_fact as df
join date_dimension as dd on df.date_id = dd.id
join time_dimension as td on df.time_id = td.id
group by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id
order by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id
With regards to the time the slowest time to get the data is the 5 minute period. This takes about 20 seconds getting all regions & power for a 2 week period. When you query only the generation for a single region (what will be the primary use case) it is < 3 seconds in my local environment.
I tried different queries (using with statements and joins) but crashed out beekeeper studio doing this so a better query has not been found.
Addressed in PR https://github.com/michaelgambold/energy-supply-demand/pull/29
After deploying to prod the performance was atrocious. The 5 minute grouping (the slowest of the queries before) took 1 minute and 20 seconds to complete.
PR https://github.com/michaelgambold/energy-supply-demand/pull/31 has been created to add extra indexes to the DB to try and address this
Further improvements have been made to the queries by removing the need to join to the date_dimension
table. Also a weird bug was fixed when calculating the sum for all fossil/green/unknown and was fixed by doing the same subquery as green to get the sum of everything.
select
min(timestamp) as timestamp,
df.region_id as "regionId",
df.power_id as "powerId",
(
select sum(value)
from data_fact as sdf
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdf.date_id = df.date_id
and std.hour = td.hour
and std.quarter_of_hour = td.quarter_of_hour
and sfd.type = 'green'
) as "greenSum",
(
select sum(value)
from data_fact as sdf
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdf.date_id = df.date_id
and std.hour = td.hour
and std.quarter_of_hour = td.quarter_of_hour
and sfd.type = 'fossil'
) as "fossilSum",
(
select sum(value)
from data_fact as sdf
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
where sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdf.date_id = df.date_id
and std.hour = td.hour
and std.quarter_of_hour = td.quarter_of_hour
and sfd.type = 'unknown'
) as "unknownSum",
(
select sum(value)
from data_fact as sdf
join time_dimension as std on sdf.time_id = std.id
join fuel_dimension as sfd on sdf.fuel_id = sfd.id
and sdf.region_id = df.region_id
and sdf.power_id = df.power_id
and sdf.date_id = df.date_id
and std.hour = td.hour
and std.quarter_of_hour = td.quarter_of_hour
) as "sum"
from data_fact as df
join time_dimension as td on df.time_id = td.id
group by df.date_id, td.hour, td.quarter_of_hour, df.region_id, df.power_id
order by min(timestamp), df.region_id, df.power_id
Locked so people can see history of the data selection issues
When performing the rollup of fuel types (green, fossil, unknown) there may be datapoints that do not align with one another. This is caused by the way that the rollups work when there is missing data.
In the example below you can see a data point for "small solar" (shown with the arrow) that is at 07:30 time. The 6 hour period that is selected should mean there should be data points at 04:00 and 10:00 (i.e. 6 hours apart). However as there is missing/no data for the small solar from 04:00 - 07:30 the min(timestamp) from the aggregation function has the value for 07:30 instead of 04:00.
This means that the rollup number is not accurate as well when summing the generation data in the UI to get the green vs fossil chart. To fix this this data will need to be queried through the API so it can do the calculation correct there.