open-contracting / kingfisher-summarize

Creates SQL tables that summarize the OCDS data in collections from Kingfisher Process
https://kingfisher-summarize.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
3 stars 8 forks source link

002 is very slow because it scans tables #92

Closed jpmckinney closed 4 years ago

jpmckinney commented 4 years ago

TLDR: Don't use a select id from selected_collections subquery. Instead, get that result in one query, and substitute it into the big query.

SET search_path = view_data_zambia_may_1260_1262, public;
EXPLAIN
select 
    r.id::bigint * 10 AS id,
    'release' AS release_type,
    r.id AS table_id,
    collection_id,
    ocid,
    release_id,
    data_id,
    package_data_id,
    coalesce(pd.data ->> 'version', '1.0') AS package_version,
    -- convert_to_timestamp(d.data ->> 'date') release_date,
    d.data ->> 'date' release_date,
    d.data -> 'tag' release_tag,
    d.data ->> 'language' release_language
from 
    release AS r
join
    package_data pd on pd.id = r.package_data_id
join
    data d on d.id = r.data_id
join
    collection c on c.id = r.collection_id
where
    collection_id in (select id from selected_collections);

Produces:

                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=18280108.61..57993583.62 rows=207953552 width=206)
   Workers Planned: 4
   ->  Hash Join  (cost=18279108.61..37197228.42 rows=51988388 width=206)
         Hash Cond: (c.id = selected_collections.id)
         ->  Hash Join  (cost=18279041.24..35760755.01 rows=51988388 width=1320)
               Hash Cond: (r.collection_id = c.id)
               ->  Parallel Hash Join  (cost=18278989.01..35623378.94 rows=51988388 width=1316)
                     Hash Cond: (r.data_id = d.id)
                     ->  Parallel Hash Left Join  (cost=2870294.06..9217201.49 rows=51988388 width=623)
                           Hash Cond: (r.package_data_id = pd.id)
                           ->  Parallel Seq Scan on release r  (cost=0.00..4594240.88 rows=51988388 width=62)
                           ->  Parallel Hash  (cost=2284679.25..2284679.25 rows=6908625 width=565)
                                 ->  Parallel Seq Scan on package_data pd  (cost=0.00..2284679.25 rows=6908625 width=565)
                     ->  Parallel Hash  (cost=12401589.64..12401589.64 rows=29665464 width=697)
                           ->  Parallel Seq Scan on data d  (cost=0.00..12401589.64 rows=29665464 width=697)
               ->  Hash  (cost=43.77..43.77 rows=677 width=4)
                     ->  Seq Scan on collection c  (cost=0.00..43.77 rows=677 width=4)
         ->  Hash  (cost=35.50..35.50 rows=2550 width=4)
               ->  Seq Scan on selected_collections  (cost=0.00..35.50 rows=2550 width=4)
(19 rows)

However, using explicit values:

EXPLAIN
select 
    r.id::bigint * 10 AS id,
    'release' AS release_type,
    r.id AS table_id,
    collection_id,
    ocid,
    release_id,
    data_id,
    package_data_id,
    coalesce(pd.data ->> 'version', '1.0') AS package_version,
    convert_to_timestamp(d.data ->> 'date') release_date,
    d.data -> 'tag' release_tag,
    d.data ->> 'language' release_language
from 
    release AS r
left join
    package_data pd on pd.id = r.package_data_id
join
    data d on d.id = r.data_id
join
    collection c on c.id = r.collection_id
where
    collection_id in (534)

Produces:

                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1001.85..4084550.32 rows=540679 width=206)
   ->  Index Only Scan using collection_pkey on collection c  (cost=0.28..8.29 rows=1 width=4)
         Index Cond: (id = 534)
   ->  Gather  (cost=1001.58..4065618.26 rows=540679 width=1316)
         Workers Planned: 4
         ->  Nested Loop  (cost=1.58..4010550.36 rows=135170 width=1316)
               ->  Nested Loop Left Join  (cost=1.01..2875396.96 rows=135170 width=623)
                     ->  Parallel Index Scan using release_collection_id_idx on release r  (cost=0.57..1762299.66 rows=135170 width=62)
                           Index Cond: (collection_id = 534)
                     ->  Index Scan using package_data_pkey on package_data pd  (cost=0.44..8.23 rows=1 width=565)
                           Index Cond: (id = r.package_data_id)
               ->  Index Scan using data_pkey on data d  (cost=0.57..8.40 rows=1 width=697)
                     Index Cond: (id = r.data_id)
(13 rows)

In a recent run, 97% of the time was spent doing these table scans:

[I 200303 05:03:55 refresh_views:50] running script: 001-functions
[I 200303 05:03:55 refresh_views:59] running time: 0.01604395406320691s
[I 200303 05:03:55 refresh_views:50] running script: 002-tmp_release_and_parties
[I 200303 05:40:45 refresh_views:59] running time: 2210.547520673019s
[I 200303 05:40:45 refresh_views:50] running script: 003-buyer-procuringentity-tenderer
[I 200303 05:40:48 refresh_views:59] running time: 2.7189306360669434s
[I 200303 05:40:48 refresh_views:50] running script: 004-planning
[I 200303 05:40:49 refresh_views:59] running time: 0.7118882929207757s
[I 200303 05:40:49 refresh_views:50] running script: 005-tender
[I 200303 05:40:54 refresh_views:59] running time: 5.3206335889408365s
[I 200303 05:40:54 refresh_views:50] running script: 006-awards
[I 200303 05:41:10 refresh_views:59] running time: 15.346484966925345s
[I 200303 05:41:10 refresh_views:50] running script: 007-contracts
[I 200303 05:41:32 refresh_views:59] running time: 22.279831449035555s
[I 200303 05:41:32 refresh_views:50] running script: 008-release
[I 200303 05:41:45 refresh_views:59] running time: 13.291127728065476s
[I 200303 05:41:45 refresh_views:61] total running time: 2270.234646508936s

Resource: https://thoughtbot.com/blog/reading-an-explain-analyze-query-plan

kindly commented 4 years ago

I am not sure the problem is really in select id from selected_collections as this just equates to another join which will make things slower but should only really should do so marginally.

I think it is due to the query planner getting things very wrong.

Having looked into this for a while I ran analyze on the database. I also set random-page-cost to 2.0 for the database. This is advise given for an SSD

Anyway with the above the query plan is now.

 Nested Loop  (cost=1.85..3879803.29 rows=303894 width=207)
   ->  Nested Loop Left Join  (cost=1.28..2512221.68 rows=303894 width=638)
         ->  Nested Loop  (cost=0.85..1192288.65 rows=303894 width=63)
               ->  Nested Loop  (cost=0.28..5.31 rows=1 width=8)
                     ->  Seq Scan on selected_collections  (cost=0.00..1.01 rows=1 width=4)
                     ->  Index Only Scan using collection_pkey on collection c  (cost=0.28..4.29 rows=1 width=4)
                           Index Cond: (id = selected_collections.id)
               ->  Index Scan using release_collection_id_idx on release r  (cost=0.57..1185010.83 rows=727251 width=63)
                     Index Cond: (collection_id = c.id)
         ->  Index Scan using package_data_pkey on package_data pd  (cost=0.44..4.34 rows=1 width=579)
               Index Cond: (id = r.package_data_id)
   ->  Index Scan using data_pkey on data d  (cost=0.57..4.48 rows=1 width=696)
         Index Cond: (id = r.data_id)
(13 rows)

The only seq scan is on selected collections (which is fine as it only has 1 row in it!)

Also the run is now

(.ve) ocdskfp@process1:~/ocdskingfisherviews$ python ocdskingfisher-views-cli refresh-views collection_534 
[I 200316 13:01:50 refresh_views:39] running script: 001-functions
[I 200316 13:01:50 refresh_views:47] running time: 0.023357978090643883s
[I 200316 13:01:50 refresh_views:39] running script: 002-tmp_release_and_parties
[I 200316 13:02:13 refresh_views:47] running time: 23.020595516078174s
[I 200316 13:02:13 refresh_views:39] running script: 003-buyer-procuringentity-tenderer
[I 200316 13:02:16 refresh_views:47] running time: 2.988911972846836s
[I 200316 13:02:16 refresh_views:39] running script: 004-planning
[I 200316 13:02:17 refresh_views:47] running time: 0.7676320909522474s
[I 200316 13:02:17 refresh_views:39] running script: 005-tender
[I 200316 13:02:22 refresh_views:47] running time: 5.147628736216575s
[I 200316 13:02:22 refresh_views:39] running script: 006-awards
[I 200316 13:02:37 refresh_views:47] running time: 14.928037466015667s
[I 200316 13:02:37 refresh_views:39] running script: 007-contracts
[I 200316 13:02:59 refresh_views:47] running time: 21.61525326874107s
[I 200316 13:02:59 refresh_views:39] running script: 008-release
[I 200316 13:03:12 refresh_views:47] running time: 13.56462255306542s
[I 200316 13:03:12 refresh_views:49] total running time: 82.05779977981001s

Which is much better.

My reservation with hard coding the collection_id in to queries is that it will require some kind of templating or plpgsql in order to do it making the scripts more complicated.

The above changes should also help out make all other queries faster too.

jpmckinney commented 4 years ago

Let’s do your change, since it’s suggested in PostgreSQL docs.

I knew it was a planning issue, but didn’t know which variables could be safely changed without worsening performance elsewhere.

jpmckinney commented 4 years ago

If you’ve already made the change, please add it to Salt.

Also, in what scenarios is it appropriate to run ANALYZE on the database?

jpmckinney commented 4 years ago

We should also add a note in the docs about this tuning for SSDs.

kindly commented 4 years ago

Added it to salt.

ANALYSE is run with autovacuum but as the selected_collections tables is so small it does not appear to be run as autovauum is not run on small tables.

I was thinking of running ANALYSE selected_collections in the views SQL on that table but thought that was an odd thing to do.

Will add it to docs before this is closed.

robredpath commented 4 years ago

Improve performance of 002 (kf-views)

robredpath commented 4 years ago

Salt work is in https://github.com/open-contracting/deploy/commit/54be5983f060b4ebffcd823bcee12292c904569d .

My reading of the conversation above is that the only remaining TODOs are:

Incidentally, I had a look at the random_page_cost stuff as I was looking at this issue, and I think that we could set it lower in the future; the difference between sequential and random reads for NVMe is even less than regular SSD.

jpmckinney commented 4 years ago

Re: docs, let's instead add a comment to the Salt state to explain how we arrived at that page cost, with a link to https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS and any other relevant links.

Performance reported in https://github.com/open-contracting/kingfisher-views/issues/92#issuecomment-599508625 is much improved. All good in that respect.

robredpath commented 4 years ago

@jpmckinney will add a quick update to the Views docs to suggest to anyone running their own Views that they consider this if they encounter similar issues

jpmckinney commented 4 years ago

Done!

jpmckinney commented 4 years ago

Thanks to #93, we have timing logs, and I see that we have bad performance again from this query, even though random_page_cost is 2 (try with the schema in the issue description, or with dominican_republic).

I ran SET random_page_cost = 1; (the minimum value that makes sense) before the query, and I still get a bad plan. So, I think we need to use another option.