openclimatefix / pv-site-api

Site specific API for PV forecasting
5 stars 8 forks source link

Create desc index #98

Open peterdudfield opened 1 year ago

peterdudfield commented 1 year ago

It might be sueful to add some index which desc in time. This is becasue we load the most recent data (often)

I`ve tried adding

create index ix_generation_start_utc_desc 
on generation (start_utc desc)

It took 24 seconds to make

peterdudfield commented 1 year ago

Above didnt really help

I also tried create index ix_forecasts_timestamp_utc_desc on forecasts (timestamp_utc desc)

but it didnt really help

peterdudfield commented 1 year ago

Tyring create index ix_forecast_values_start_utc_horizon_minutes on forecast_values (start_utc, horizon_minutes)

It took 1.26 minutes to make

peterdudfield commented 1 year ago

Want to try cliustering on horizon_minutes


on forecast_values (horizon_minutes)

ALTER TABLE forecast_values
CLUSTER ON ix_forecast_values_horizon_minutes;```

This took 1.11 seconds to make
simlmx commented 1 year ago

Can you share the query you are trying to optimize?

peterdudfield commented 1 year ago

The forecast past one

peterdudfield commented 1 year ago

I thought we could partition the table into horizon_minutes=0 and horizon_minutes>0

peterdudfield commented 1 year ago
SELECT DISTINCT ON (forecasts.site_uuid, forecasts.timestamp_utc) *
FROM forecasts JOIN forecast_values ON forecasts.forecast_uuid = forecast_values.forecast_uuid
WHERE forecasts.site_uuid IN ('04d67e7f-7bcf-4b55-8ced-14d311e0a5f5', 
                              '0bd3dc11-8160-46a0-9ff7-36e125750aeb') 
AND forecasts.timestamp_utc >= '2023-06-15'
AND forecasts.timestamp_utc < '2023-06-16 15:45'
AND forecast_values.horizon_minutes = 0
AND forecast_values.start_utc >= '2023-06-15'
AND forecast_values.start_utc < '2023-06-16 15:45'
ORDER BY forecasts.site_uuid, forecasts.timestamp_utc
simlmx commented 1 year ago

You mean the 2 queries called when this is called? https://github.com/openclimatefix/pv-site-api/blob/411277ea1585bb4a41982f363561afece86ee6d1/pv_site_api/_db_helpers.py#L138

If so I guess the query on past forecasts is the slow one? It's worth noting that the data returned by this query never changes (because it's in the past). As time goes by the only thing that changes is that the latest forecast (which is faster to query, that's the second query in our 2 queries).

My point is that instead of calling both queries systematically, we could call the first query once (slow but can be done async and that's typically not what the user is looking for) and then update it with the most latest forecast (which is fast) as they become available. We could even keep the data in the browser cache so that if they refresh it's already there. And if they close their window, come back 30 minutes later, we could just query the missing bit, which would be faster.

I'm not sure it's an easy problem to fix otherwise.

peterdudfield commented 1 year ago

I agree, they might not want that immediately,

I'm going to try some partitioning, and see if it helps

simlmx commented 1 year ago

Cool! It could be a good idea to make a separate toy database filled with a lot of (random) data and do tests there, to remove possible external factors, and probably iterate faster.

peterdudfield commented 1 year ago

Partition from table could be

-- drop table forecast_values_new2;

CREATE TABLE forecast_values_temp
(like forecast_values including all)

alter table forecast_values_temp
    drop constraint forecast_values_temp_pkey,
    add primary key (forecast_value_uuid, horizon_minutes);

create table forecast_values_new 
(like forecast_values_temp including all)
partition by range (horizon_minutes);

CREATE TABLE forecast_values_0 PARTITION OF forecast_values_new
    FOR VALUES FROM (0) TO (1);

CREATE TABLE forecast_values_g0 PARTITION OF forecast_values_new
    FOR VALUES FROM (1) TO (10000000);

-- This can take some time > 10
INSERT INTO forecast_values_new
SELECT * FROM forecast_values

ALTER TABLE forecast_values RENAME TO forecast_values_old;
ALTER TABLE forecast_values_new RENAME TO forecast_values;

And then do it in the datamodel

I want to try

alter table forecast_values
    drop constraint forecast_values_pkey,
    add primary key (forecast_value_uuid, horizon_minutes);

first

simlmx commented 1 year ago

It sounds dangerous to try those in the Dev/Prod databases. You could end up in a state that alembic doesn't recognize by mistake no?

peterdudfield commented 1 year ago
alter table forecast_values
    drop constraint forecast_values_pkey,
    add primary key (forecast_value_uuid, horizon_minutes);

This didnt work, Ill role this back.

I'm doing it in 'dev', the above meanes I can always role back, never deleting anything

peterdudfield commented 1 year ago

I made a parititon table and done the renaming, will have to leave it over the weekend to see if it speeds things up or not.

peterdudfield commented 10 months ago

I rolled the partition back