Extra points shown in fuel type rollup

michaelgambold / energy-supply-demand

Service & web app to get and display current & historical energy supply & demand

MIT License

0 stars 0 forks source link

Extra points shown in fuel type rollup #27

Open michaelgambold opened 2 years ago

michaelgambold commented 2 years ago

When performing the rollup of fuel types (green, fossil, unknown) there may be datapoints that do not align with one another. This is caused by the way that the rollups work when there is missing data.

In the example below you can see a data point for "small solar" (shown with the arrow) that is at 07:30 time. The 6 hour period that is selected should mean there should be data points at 04:00 and 10:00 (i.e. 6 hours apart). However as there is missing/no data for the small solar from 04:00 - 07:30 the min(timestamp) from the aggregation function has the value for 07:30 instead of 04:00.

This means that the rollup number is not accurate as well when summing the generation data in the UI to get the green vs fossil chart. To fix this this data will need to be queried through the API so it can do the calculation correct there.

4567D399-F713-4005-B9EF-55B39EAF30B9_1_105_c

michaelgambold commented 2 years ago

The information that we want to show is the breakdown of green vs fossil fuels. Some sample data would be the following for a given timestamp (ignoring the missing data issue for the moment).

// 1 minute data/raw data resolution
{
  coal: 100,
  gas: 200,
  solar: 50,
  hydro: 150
}

// we would expect the following
{
  fossil: 300,
  green: 200
}

This is easy as grouping by "green" fuel's and performing sum(value) is easy.

When we deal with with aggregated data there can be data gaps that occur. This means that for a given series/fuel (say coal and gas) the min(timestamp) may have different values when looking at a certain period (say 1 hour). E.g. the `min(timestamp) for the previously mentioned fuels they could be 10:00 and 10:15. This means you cannot simply add the aggregated series value together as time timestamps are different.

Although we could do some fuzzy logic to work out if we should add them together this would need to be aware of what period is selected to add it to the correct value. I.e. for 1 hour period any minute in that hour gets added to the hour and for 15 minute period anything from 0 to 15 would be the first quarter and so on.

It would be better if this was handled in the backend. However this get's difficult when dealing with time periods that are not "raw". When looking at individual series we can simply take the average of the fuels value over that period. We want the sum of the average however the same problem of aligning timestamps exists.

To make it more resilient we can `sum(value) for the given period. This would avoid the issue of the timestamp's not aligning but would the values huge the larger the period (eg 1 Day).

What we can do is convert the sum to a percentage (i.e. sum(green values) / sum(all values) and do the same for "fossil" and "unknown" fuel types.

Regardless of the period and data gaps this would give an accurate green vs fossil vs unknown breakdown.

michaelgambold commented 2 years ago

A sample SQL statement to get the totals for hourly period data. It uses sub queries to get the green, fossil, unknown and all sum's. However on sample data this takes 4.6 seconds to run for 769 timestamps/30,150 records. This data would then need to be mapped and the percentage calculated (i.e. fossilSum / allSum).

select 
    min(timestamp) as timestamp,
    df.region_id as "regionId",
    df.power_id as "powerId",
    (
      select sum(value)
      from data_fact as sdf
      join date_dimension as sdd on sdf.date_id = sdd.id
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdd.year = dd.year
      and sdd.month_number = dd.month_number
      and sdd.day_of_month = dd.day_of_month
      and std.hour = td.hour
      and sfd.type = 'green'
    ) as "greenSum",
    (
      select sum(value)
      from data_fact as sdf
      join date_dimension as sdd on sdf.date_id = sdd.id
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdd.year = dd.year
      and sdd.month_number = dd.month_number
      and sdd.day_of_month = dd.day_of_month
      and std.hour = td.hour
      and sfd.type = 'fossil'
    ) as "fossilSum",
    (
      select sum(value)
      from data_fact as sdf
      join date_dimension as sdd on sdf.date_id = sdd.id
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdd.year = dd.year
      and sdd.month_number = dd.month_number
      and sdd.day_of_month = dd.day_of_month
      and std.hour = td.hour
      and sfd.type = 'unknown'
    ) as "unknownSum",
    (
      select sum(value)
      from data_fact as sdf
      join date_dimension as sdd on sdf.date_id = sdd.id
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdd.year = dd.year
      and sdd.month_number = dd.month_number
      and sdd.day_of_month = dd.day_of_month
      and std.hour = td.hour
    ) as "allSum"
from data_fact as df
join date_dimension as dd on df.date_id = dd.id
join time_dimension as td on df.time_id = td.id
group by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id
order by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id

michaelgambold commented 2 years ago

After testing a performance improvement can be made by not performing a sub query for the "allSum" data as this is what the parent query already has (i.e. no need to re-query it).

In my local dev environment this resulted in performance improvements as listed below (for given periods):

1 Min → ~20% reduction
5 Min → ~23% reduction
15 Min → ~23.5% reduction
1 Hour → ~22.5% reduction
6 Min → ~13% reduction
1 Day → ~13.5% reduction

select 
    min(timestamp) as timestamp,
    df.region_id as "regionId",
    df.power_id as "powerId",
    (
      select sum(value)
      from data_fact as sdf
      join date_dimension as sdd on sdf.date_id = sdd.id
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdd.year = dd.year
      and sdd.month_number = dd.month_number
      and sdd.day_of_month = dd.day_of_month
      and std.hour = td.hour
      and sfd.type = 'green'
    ) as "greenSum",
    (
      select sum(value)
      from data_fact as sdf
      join date_dimension as sdd on sdf.date_id = sdd.id
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdd.year = dd.year
      and sdd.month_number = dd.month_number
      and sdd.day_of_month = dd.day_of_month
      and std.hour = td.hour
      and sfd.type = 'fossil'
    ) as "fossilSum",
    (
      select sum(value)
      from data_fact as sdf
      join date_dimension as sdd on sdf.date_id = sdd.id
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdd.year = dd.year
      and sdd.month_number = dd.month_number
      and sdd.day_of_month = dd.day_of_month
      and std.hour = td.hour
      and sfd.type = 'unknown'
    ) as "unknownSum",
    sum(value) as "allSum"   <----- sum here instead of sub query
from data_fact as df
join date_dimension as dd on df.date_id = dd.id
join time_dimension as td on df.time_id = td.id
group by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id
order by dd.year, dd.month_number, dd.day_of_month, td.hour, df.region_id, df.power_id

michaelgambold commented 2 years ago

With regards to the time the slowest time to get the data is the 5 minute period. This takes about 20 seconds getting all regions & power for a 2 week period. When you query only the generation for a single region (what will be the primary use case) it is < 3 seconds in my local environment.

I tried different queries (using with statements and joins) but crashed out beekeeper studio doing this so a better query has not been found.

michaelgambold commented 2 years ago

Addressed in PR https://github.com/michaelgambold/energy-supply-demand/pull/29

michaelgambold commented 2 years ago

After deploying to prod the performance was atrocious. The 5 minute grouping (the slowest of the queries before) took 1 minute and 20 seconds to complete.

PR https://github.com/michaelgambold/energy-supply-demand/pull/31 has been created to add extra indexes to the DB to try and address this

michaelgambold commented 2 years ago

Further improvements have been made to the queries by removing the need to join to the date_dimension table. Also a weird bug was fixed when calculating the sum for all fossil/green/unknown and was fixed by doing the same subquery as green to get the sum of everything.

select 
    min(timestamp) as timestamp,
    df.region_id as "regionId",
    df.power_id as "powerId",
    (
      select sum(value)
      from data_fact as sdf
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdf.date_id = df.date_id
      and std.hour = td.hour
      and std.quarter_of_hour = td.quarter_of_hour
      and sfd.type = 'green'
    ) as "greenSum",
    (
      select sum(value)
      from data_fact as sdf
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdf.date_id = df.date_id
      and std.hour = td.hour
      and std.quarter_of_hour = td.quarter_of_hour
      and sfd.type = 'fossil'
    ) as "fossilSum",
    (
      select sum(value)
      from data_fact as sdf
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      where sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdf.date_id = df.date_id
      and std.hour = td.hour
      and std.quarter_of_hour = td.quarter_of_hour
      and sfd.type = 'unknown'
    ) as "unknownSum",
    (
      select sum(value)
      from data_fact as sdf
      join time_dimension as std on sdf.time_id = std.id
      join fuel_dimension as sfd on sdf.fuel_id = sfd.id
      and sdf.region_id = df.region_id
      and sdf.power_id = df.power_id
      and sdf.date_id = df.date_id
      and std.hour = td.hour
      and std.quarter_of_hour = td.quarter_of_hour
    ) as "sum"
from data_fact as df
join time_dimension as td on df.time_id = td.id
group by df.date_id, td.hour, td.quarter_of_hour, df.region_id, df.power_id
order by min(timestamp), df.region_id, df.power_id

michaelgambold commented 2 years ago

Locked so people can see history of the data selection issues