openclimatefix / nowcasting_datamodel

Datamodel for the nowcasting project
6 stars 6 forks source link

Speed up forecast value read table #116

Open peterdudfield opened 1 year ago

peterdudfield commented 1 year ago

Detailed Description

Currently takes ~ 6 seconds for each GSP to read forecast value table when looking at a X hour forecast.

Query at the momment is

Select DISTINCT ON (forecast_value.target_time) forecast_value.created_utc AS forecast_value_created_utc, forecast_value.uuid AS forecast_value_uuid, forecast_value.target_time AS forecast_value_target_time, forecast_value.expected_power_generation_megawatts AS forecast_value_expected_power_generation_megawatts, forecast_value.forecast_id AS forecast_value_forecast_id
FROM forecast_value 
JOIN forecast ON forecast.id = forecast_value.forecast_id 
JOIN location ON location.id = forecast.location_id
AND location.gsp_id = 105
and forecast_value.target_time >= '2022-10-30' 
AND forecast_value.created_utc >= '2022-10-29' 
AND forecast_value.target_time - forecast_value.created_utc >= interval '240 minute' 
AND forecast_value.created_utc < '2022-10-31 13:01'
AND forecast.created_utc >= '2022-10-30' 
ORDER BY forecast_value.target_time, forecast_value.created_utc DESC;

We have already partitation this table, but there are about 36 million rows per month. This means by the end of the month, its a bit slow. ~ 6 seconds per query. Note if you query again its quicker (this is how SQL works)

Context

Possible Implementation

peterdudfield commented 1 year ago

Test results

This was done on different gsp locations With no datetim interval [s] 15(gsp=35) 10 (gsp 99) 15 (gsp 189) 12 (gsp 250) 6 (gsp 259) 18 (gsp 121) 18 (gsp 200) 12 (gsp 150) 2 (gsp 2) 2 (gsp 4) 1 (gsp 6) 2 (gsp 8)

mean = 7

With no datetime interval [s] 10 (gsp=67) 2 (gsp 72) 20 (gsp 267) 15 (gsp 135) 6 (gsp 259) 10 (gsp 301) 4 (gsp 316) 8 (gsp 100) 19 (gsp 10) 3 (gsp 1) 2 (gsp 3) 2 (gsp 5) 2 (gsp 7)

mean = 8

I tried to make an index, but it took > 10 mins CREATE INDEX forecast_value_2022_10_datetime_interval_id on forecast_value_2022_10(datetime_interval_id)

EXPLAIN ANALYZE Select DISTINCT ON (forecast_value.target_time) forecast_value.created_utc AS forecast_value_created_utc, forecast_value.uuid AS forecast_value_uuid, forecast_value.target_time AS forecast_value_target_time, forecast_value.expected_power_generation_megawatts AS forecast_value_expected_power_generation_megawatts, forecast_value.forecast_id AS forecast_value_forecast_id
FROM forecast_value 
JOIN forecast ON forecast.id = forecast_value.forecast_id 
JOIN location ON location.id = forecast.location_id
JOIN datetime_interval ON forecast_value.datetime_interval_id = datetime_interval.id
AND location.gsp_id = 211
-- and datetime_interval.start_datetime_utc >= '2022-10-25' 
-- and datetime_interval.end_datetime_utc <= '2022-10-26' 
and forecast_value.target_time >= '2022-10-25' 
and forecast_value.target_time <= '2022-10-26' 
AND forecast_value.created_utc >= '2022-10-25' 
AND forecast_value.target_time - forecast_value.created_utc >= interval '240 minute' 
AND forecast_value.created_utc < '2022-10-26 13:01'
AND forecast.created_utc >= '2022-10-25' 
ORDER BY forecast_value.target_time, forecast_value.created_utc DESC;
peterdudfield commented 1 year ago

Could also add index on creation_utc - CREATE INDEX forecast_value_2022_10_created_utc on forecast_value_2022_10(created_utc DESC)

It took 20 mins, but maybe it could have been quicker

peterdudfield commented 1 year ago

Other option

  1. Try changing double precision to real - https://www.postgresql.org/docs/current/datatype-numeric.html - done
  2. Look into cluster on the an index
  3. Create table that only stores the last 'week' of data, then read this - https://github.com/openclimatefix/nowcasting_datamodel/issues/124
  4. create INDEX idx_forecast_created_utc on forecast(created_utc) (THIS SEEMS TO HELP) Also did index on forecast_value - created_utc. They both probably help
  5. reorder query, to filter on sub tables, rather than joining and filtering (NOT MUCH DIFFERENCE)
peterdudfield commented 1 year ago

Would be good to bench makr against production to check the changes are doing the desired effect

peterdudfield commented 1 year ago

Above 3. should mean that there is atmost 7 days of data to read. This should make the fast to load. Lets review in 1 weeks time