Closed duncandewhurst closed 3 years ago
According to the query plan, almost all the time (87%) is spent in the counts
CTE - which is expected because a CROSS JOIN on a JSON field is incredibly slow. There are some things to tidy, but I don't think time can be diminished.
I also note that even on this 1% sample, that CTE has:
I don't think 100% is a possibility.
In that case, best not to add it to Kingfisher Summarize, as 100% is preferable for smaller collections.
The additional fields query from the data quality feedback notebook is very slow for large collections with many additional fields, e.g. collection
2337
inview_data_collection_2337_2338
from the Kyrgyz Republic has ~600k releases and >100 additional fields.As I temporary workaround, I added a
WHERE random() < 0.01
condition to the first CTE to limit the query to a sample approx ~1% of the dataset and generated a query plan.Looking at the query plan, the slowest step is a parallel seq scan on
release_check
. If I understood correctly, it relates to https://github.com/open-contracting/kingfisher-summarize/blob/8681551c8418aee4c34e6ee9dbbb757027f062bb/sql/middle/release_summary.sql#L67However, all of the columns in the JOIN are already in an index so I'm unsure how to optimise this further.
Although they don't look too costly in the query plan, presumably the CROSS JOINs in the subsequent CTEs would also be costly when the query is run against the whole dataset.
Would it be possible to add an additional fields table to Kingfisher Summarize based on this query? Its output looks like this:
cc @odscrachel