Speed up read function - Githubissues

peterdudfield commented 1 year ago

Currently pulling one forecast takes 30 seconds on API call - http://pvsite-development-api-sites.eu-west-1.elasticbeanstalk.com/docs#/default/get_pv_forecast_sites_pv_forecast__site_uuid__get

SQL is

SELECT DISTINCT ON (forecasts.site_uuid, datetime_intervals.start_utc) *
FROM forecast_values 
JOIN datetime_intervals ON datetime_intervals.datetime_interval_uuid = forecast_values.datetime_interval_uuid 
JOIN forecasts ON forecasts.forecast_uuid = forecast_values.forecast_uuid
WHERE datetime_intervals.start_utc >= '2023-02-02'
AND forecast_values.created_utc >= '2023-02-02'
AND forecasts.site_uuid IN ('725a8670-d012-474d-b901-1179f43e7182')
order by forecasts.site_uuid, datetime_intervals.start_utc, forecast_values.created_utc desc

which only takes ~1 seconds on pgAdmin

The slowness might be due to swagger docs

peterdudfield commented 1 year ago

Ive added some logging in api 0.0.21 to see what is taking a long time

simlmx commented 1 year ago

Running your query directly in psql takes about 500ms, which is still awfully slow. This is not surprising given that no indexes have been defined on the database. This is the EXPLAIN ANALYZE output:

                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=16917.77..20807.38 rows=32695 width=146) (actual time=304.545..333.378 rows=9332 loops=1)
   ->  Gather Merge  (cost=16917.77..20725.64 rows=32695 width=146) (actual time=304.544..327.944 rows=38976 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=15917.74..15951.80 rows=13623 width=146) (actual time=284.618..287.307 rows=12992 loops=3)
               Sort Key: datetime_intervals.start_utc, forecast_values.created_utc DESC
               Sort Method: quicksort  Memory: 3886kB
               Worker 0:  Sort Method: external merge  Disk: 2440kB
               Worker 1:  Sort Method: quicksort  Memory: 3010kB
               ->  Hash Join  (cost=518.44..14982.27 rows=13623 width=146) (actual time=33.265..258.976 rows=12992 loops=3)
                     Hash Cond: (forecast_values.datetime_interval_uuid = datetime_intervals.datetime_interval_uuid)
                     ->  Hash Join  (cost=97.62..14515.31 rows=17575 width=106) (actual time=17.753..228.061 rows=12992 loops=3)
                           Hash Cond: (forecast_values.forecast_uuid = forecasts.forecast_uuid)
                           ->  Parallel Seq Scan on forecast_values  (cost=0.00..13633.72 rows=298355 width=60) (actual time=13.703..134.736 rows=221755 loops=3)
                                 Filter: (created_utc >= '2023-02-02 00:00:00'::timestamp without time zone)
                                 Rows Removed by Filter: 30528
                           ->  Hash  (cost=94.47..94.47 rows=252 width=46) (actual time=2.076..2.077 rows=231 loops=3)
                                 Buckets: 1024  Batches: 1  Memory Usage: 26kB
                                 ->  Seq Scan on forecasts  (cost=0.00..94.47 rows=252 width=46) (actual time=0.027..1.879 rows=231 loops=3)
                                       Filter: (site_uuid = '725a8670-d012-474d-b901-1179f43e7182'::uuid)
                                       Rows Removed by Filter: 3712
                     ->  Hash  (cost=287.27..287.27 rows=10683 width=40) (actual time=12.546..12.547 rows=10668 loops=3)
                           Buckets: 16384  Batches: 1  Memory Usage: 879kB
                           ->  Seq Scan on datetime_intervals  (cost=0.00..287.27 rows=10683 width=40) (actual time=0.872..4.645 rows=10668 loops=3)
                                 Filter: (start_utc >= '2023-02-02 00:00:00'::timestamp without time zone)
                                 Rows Removed by Filter: 3017
 Planning Time: 0.348 ms
 Execution Time: 334.118 ms

I'm not a pro at interpreting those but in general "Seq scan" are bad. Adding indexes on these should fix that:

forecast_values.created_utc
datetime_intervals.start_utc
forecasts.site_uuid

Those basically correspond to the WHERE clause.

Indexes are probably needed on most of the ForeignKeys.

simlmx commented 1 year ago

Ah I just realized I misread your post! Sorry! The index stuff is still valid though I think, but won't fix your 30 seconds issue!

peterdudfield commented 1 year ago

Once all the datetimes are in 5 mins blocks, this will speed up

openclimatefix / pv-site-datamodel

Speed up read function #43