open-contracting / kingfisher-summarize

Creates SQL tables that summarize the OCDS data in collections from Kingfisher Process
https://kingfisher-summarize.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
3 stars 8 forks source link

New table: Additional fields #213

Closed duncandewhurst closed 3 years ago

duncandewhurst commented 3 years ago

The additional fields query from the data quality feedback notebook is very slow for large collections with many additional fields, e.g. collection 2337 in view_data_collection_2337_2338 from the Kyrgyz Republic has ~600k releases and >100 additional fields.

As I temporary workaround, I added a WHERE random() < 0.01 condition to the first CTE to limit the query to a sample approx ~1% of the dataset and generated a query plan.

Looking at the query plan, the slowest step is a parallel seq scan on release_check. If I understood correctly, it relates to https://github.com/open-contracting/kingfisher-summarize/blob/8681551c8418aee4c34e6ee9dbbb757027f062bb/sql/middle/release_summary.sql#L67

However, all of the columns in the JOIN are already in an index so I'm unsure how to optimise this further.

                                     Table "public.release_check"
         Column          |  Type   | Collation | Nullable |                  Default                  
-------------------------+---------+-----------+----------+-------------------------------------------
 id                      | integer |           | not null | nextval('release_check_id_seq'::regclass)
 release_id              | integer |           | not null | 
 override_schema_version | text    |           | not null | 
 cove_output             | jsonb   |           | not null | 
Indexes:
    "release_check_pkey" PRIMARY KEY, btree (id)
    "release_check_release_id_idx" btree (release_id)
    "unique_release_check_release_id_and_more" UNIQUE CONSTRAINT, btree (release_id, override_schema_version)
Foreign-key constraints:
    "fk_release_check_release_id" FOREIGN KEY (release_id) REFERENCES release(id)

Although they don't look too costly in the query plan, presumably the CROSS JOINs in the subsequent CTEs would also be costly when the query is run against the whole dataset.

Would it be possible to add an additional fields table to Kingfisher Summarize based on this query? Its output looks like this:

collection_id release_type schema_version path field count examples
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json   links 6260  
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /links next 6260 ["http://ocds.zakupki.gov.kg/api/tendering?offset=10194"]
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /links prev 6260 ["http://ocds.zakupki.gov.kg/api/tendering?offset=97299"]
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /releases complaints 86  
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /releases/awards/value empty 20073 [false, false]
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /releases/bids/details priceProposal 12248  
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /releases/bids/details/priceProposal id 68239 ["126755023", "126755022", "126755021"]
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /releases/bids/details/priceProposal relatedItem 68239 ["245574450", "245574450", "245574450"]
2337 release https://standard.open-contracting.org/1.1/en/release-package-schema.json /releases/bids/details/priceProposal relatedLot 68239 ["175765676", "175765676", "175765676"]

cc @odscrachel

jpmckinney commented 3 years ago

According to the query plan, almost all the time (87%) is spent in the counts CTE - which is expected because a CROSS JOIN on a JSON field is incredibly slow. There are some things to tidy, but I don't think time can be diminished.

I also note that even on this 1% sample, that CTE has:

I don't think 100% is a possibility.

duncandewhurst commented 3 years ago

In that case, best not to add it to Kingfisher Summarize, as 100% is preferable for smaller collections.