Speed up forecast value read table

peterdudfield commented 2 years ago

Detailed Description

Currently takes ~ 6 seconds for each GSP to read forecast value table when looking at a X hour forecast.

Query at the momment is

Select DISTINCT ON (forecast_value.target_time) forecast_value.created_utc AS forecast_value_created_utc, forecast_value.uuid AS forecast_value_uuid, forecast_value.target_time AS forecast_value_target_time, forecast_value.expected_power_generation_megawatts AS forecast_value_expected_power_generation_megawatts, forecast_value.forecast_id AS forecast_value_forecast_id
FROM forecast_value 
JOIN forecast ON forecast.id = forecast_value.forecast_id 
JOIN location ON location.id = forecast.location_id
AND location.gsp_id = 105
and forecast_value.target_time >= '2022-10-30' 
AND forecast_value.created_utc >= '2022-10-29' 
AND forecast_value.target_time - forecast_value.created_utc >= interval '240 minute' 
AND forecast_value.created_utc < '2022-10-31 13:01'
AND forecast.created_utc >= '2022-10-30' 
ORDER BY forecast_value.target_time, forecast_value.created_utc DESC;

We have already partitation this table, but there are about 36 million rows per month. This means by the end of the month, its a bit slow. ~ 6 seconds per query. Note if you query again its quicker (this is how SQL works)

Context

FE app is slow due to this slow read

Possible Implementation

add datetime_interval join to forecast_value table. This means we would filter on these valyes and there would be no duplicate datetimes. This is fairly common pratice in databases
To test if this works, we could manually 1 . add a column to a partition table, ALTER TABLE forecast_value ADD datetime_interval_id int; 2 . update and add datetime_intervals 3 . then run updated query, to see if it speeds up. If it does, can then add it in

peterdudfield commented 2 years ago

Test results

This was done on different gsp locations With no datetim interval [s] 15(gsp=35) 10 (gsp 99) 15 (gsp 189) 12 (gsp 250) 6 (gsp 259) 18 (gsp 121) 18 (gsp 200) 12 (gsp 150) 2 (gsp 2) 2 (gsp 4) 1 (gsp 6) 2 (gsp 8)

mean = 7

With no datetime interval [s] 10 (gsp=67) 2 (gsp 72) 20 (gsp 267) 15 (gsp 135) 6 (gsp 259) 10 (gsp 301) 4 (gsp 316) 8 (gsp 100) 19 (gsp 10) 3 (gsp 1) 2 (gsp 3) 2 (gsp 5) 2 (gsp 7)

mean = 8

I tried to make an index, but it took > 10 mins CREATE INDEX forecast_value_2022_10_datetime_interval_id on forecast_value_2022_10(datetime_interval_id)

EXPLAIN ANALYZE Select DISTINCT ON (forecast_value.target_time) forecast_value.created_utc AS forecast_value_created_utc, forecast_value.uuid AS forecast_value_uuid, forecast_value.target_time AS forecast_value_target_time, forecast_value.expected_power_generation_megawatts AS forecast_value_expected_power_generation_megawatts, forecast_value.forecast_id AS forecast_value_forecast_id
FROM forecast_value 
JOIN forecast ON forecast.id = forecast_value.forecast_id 
JOIN location ON location.id = forecast.location_id
JOIN datetime_interval ON forecast_value.datetime_interval_id = datetime_interval.id
AND location.gsp_id = 211
-- and datetime_interval.start_datetime_utc >= '2022-10-25' 
-- and datetime_interval.end_datetime_utc <= '2022-10-26' 
and forecast_value.target_time >= '2022-10-25' 
and forecast_value.target_time <= '2022-10-26' 
AND forecast_value.created_utc >= '2022-10-25' 
AND forecast_value.target_time - forecast_value.created_utc >= interval '240 minute' 
AND forecast_value.created_utc < '2022-10-26 13:01'
AND forecast.created_utc >= '2022-10-25' 
ORDER BY forecast_value.target_time, forecast_value.created_utc DESC;

peterdudfield commented 2 years ago

Could also add index on creation_utc - CREATE INDEX forecast_value_2022_10_created_utc on forecast_value_2022_10(created_utc DESC)

It took 20 mins, but maybe it could have been quicker

peterdudfield commented 2 years ago

Other option

Try changing double precision to real - https://www.postgresql.org/docs/current/datatype-numeric.html - done
Look into cluster on the an index
Create table that only stores the last 'week' of data, then read this - https://github.com/openclimatefix/nowcasting_datamodel/issues/124
create INDEX idx_forecast_created_utc on forecast(created_utc) (THIS SEEMS TO HELP) Also did index on forecast_value - created_utc. They both probably help
reorder query, to filter on sub tables, rather than joining and filtering (NOT MUCH DIFFERENCE)

peterdudfield commented 2 years ago

Would be good to bench makr against production to check the changes are doing the desired effect

peterdudfield commented 2 years ago

Above 3. should mean that there is atmost 7 days of data to read. This should make the fast to load. Lets review in 1 weeks time

openclimatefix / nowcasting_datamodel