New table: Additional fields

duncandewhurst commented 3 years ago

The additional fields query from the data quality feedback notebook is very slow for large collections with many additional fields, e.g. collection 2337 in view_data_collection_2337_2338 from the Kyrgyz Republic has ~600k releases and >100 additional fields.

As I temporary workaround, I added a WHERE random() < 0.01 condition to the first CTE to limit the query to a sample approx ~1% of the dataset and generated a query plan.

Looking at the query plan, the slowest step is a parallel seq scan on release_check. If I understood correctly, it relates to https://github.com/open-contracting/kingfisher-summarize/blob/8681551c8418aee4c34e6ee9dbbb757027f062bb/sql/middle/release_summary.sql#L67

However, all of the columns in the JOIN are already in an index so I'm unsure how to optimise this further.

                                     Table "public.release_check"
         Column          |  Type   | Collation | Nullable |                  Default                  
-------------------------+---------+-----------+----------+-------------------------------------------
 id                      | integer |           | not null | nextval('release_check_id_seq'::regclass)
 release_id              | integer |           | not null | 
 override_schema_version | text    |           | not null | 
 cove_output             | jsonb   |           | not null | 
Indexes:
    "release_check_pkey" PRIMARY KEY, btree (id)
    "release_check_release_id_idx" btree (release_id)
    "unique_release_check_release_id_and_more" UNIQUE CONSTRAINT, btree (release_id, override_schema_version)
Foreign-key constraints:
    "fk_release_check_release_id" FOREIGN KEY (release_id) REFERENCES release(id)

Although they don't look too costly in the query plan, presumably the CROSS JOINs in the subsequent CTEs would also be costly when the query is run against the whole dataset.

Would it be possible to add an additional fields table to Kingfisher Summarize based on this query? Its output looks like this:

collection_id	release_type	schema_version	path	field	count	examples
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json		links	6260
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/links	next	6260	["http://ocds.zakupki.gov.kg/api/tendering?offset=10194"]
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/links	prev	6260	["http://ocds.zakupki.gov.kg/api/tendering?offset=97299"]
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/releases	complaints	86
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/releases/awards/value	empty	20073	[false, false]
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/releases/bids/details	priceProposal	12248
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/releases/bids/details/priceProposal	id	68239	["126755023", "126755022", "126755021"]
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/releases/bids/details/priceProposal	relatedItem	68239	["245574450", "245574450", "245574450"]
2337	release	https://standard.open-contracting.org/1.1/en/release-package-schema.json	/releases/bids/details/priceProposal	relatedLot	68239	["175765676", "175765676", "175765676"]

cc @odscrachel

jpmckinney commented 3 years ago

According to the query plan, almost all the time (87%) is spent in the counts CTE - which is expected because a CROSS JOIN on a JSON field is incredibly slow. There are some things to tidy, but I don't think time can be diminished.

I also note that even on this 1% sample, that CTE has:

read 41 GB from disk
found 94 GB in the cache
read/written 31 GB of temporary data

I don't think 100% is a possibility.

duncandewhurst commented 3 years ago

In that case, best not to add it to Kingfisher Summarize, as 100% is preferable for smaller collections.

open-contracting / kingfisher-summarize

New table: Additional fields #213