uhh-lt / dats

Discourse Analysis Tool Suite
Apache License 2.0
17 stars 2 forks source link

timeline analysis SQL query is dead slow #470

Open fynnos opened 2 days ago

fynnos commented 2 days ago

the following query (generated by SQLAlchemy from python code) is dead slow for projects with large amounts of documents (and spans), like 10k+ documents. For 50k+ it never finishes and just burns Postgres CPU.

SELECT array_remove(array_agg(distinct(sourcedocument.id)), NULL) AS sdoc_ids, EXTRACT(year FROM anon_2.date) AS anon_1 
 FROM sourcedocument JOIN (SELECT sourcedocument.id AS id, sourcedocumentmetadata_1.date_value AS date, array_remove(array_agg(distinct(code.id)), NULL) AS "TA_CODE_ID_LIST" 
 FROM sourcedocument JOIN sourcedocumentmetadata AS sourcedocumentmetadata_1 ON sourcedocument.id = sourcedocumentmetadata_1.source_document_id AND sourcedocumentmetadata_1.project_metadata_id = 2234 AND sourcedocumentmetadata_1.date_value IS NOT NULL LEFT OUTER JOIN annotationdocument ON sourcedocument.id = annotationdocument.source_document_id, spanannotation LEFT OUTER JOIN code ON code.id = spanannotation.code_id 
 WHERE sourcedocument.project_id = 87 GROUP BY sourcedocument.id, sourcedocumentmetadata_1.date_value) AS anon_2 ON sourcedocument.id = anon_2.id 
 WHERE anon_2."TA_CODE_ID_LIST" @> ARRAY[9621]::INTEGER[] GROUP BY EXTRACT(year FROM anon_2.date) ORDER BY sdoc_ids DESC