slow query during cve ingestion

JimFuller-RedHat commented 1 month ago

During some trustify process (cve ingestion, read of REST API) we are hitting a query a few times (1155 times) that is unusually slow.

| 1155  | 295.72740518008635  | SELECT COUNT(*) AS num_items FROM (SELECT "qualified_purl"."id", "qualified_purl"."ver
sioned_purl_id", "qualified_purl"."qualifiers" FROM "qualified_purl" LEFT JOIN "versioned_purl" ON "qualified_purl"."versioned_purl_id" = "versioned_purl"."id" WHERE "versioned_purl"."base_purl_id" IN (SELECT "base_purl"."id" FROM "base_purl")) AS "sub_query"                                                                                                                                                            |

it is deceptively simple with a rather more complicated query plan then expected.

SELECT COUNT(*) AS num_items FROM (SELECT "qualified_purl"."id", "qualified_purl"."versioned_purl_id", "qualified_purl"."qualifiers" FROM "qualified_purl" LEFT JOIN "versioned_purl" ON "qualified_purl"."versioned_purl_id" = "versioned_purl"."id" WHERE "versioned_purl"."base_purl_id" IN (SELECT "base_purl"."id" FROM "base_purl")) AS "sub_query"

here is the query plan

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                                                       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize Aggregate  (cost=14100.77..14100.78 rows=1 width=8) (actual time=88.354..92.882 rows=1 loops=1)                                                                         |
|   ->  Gather  (cost=14100.55..14100.76 rows=2 width=8) (actual time=86.971..92.876 rows=3 loops=1)                                                                               |
|         Workers Planned: 2                                                                                                                                                       |
|         Workers Launched: 2                                                                                                                                                      |
|         ->  Partial Aggregate  (cost=13100.55..13100.56 rows=1 width=8) (actual time=84.596..84.599 rows=1 loops=3)                                                              |
|               ->  Parallel Hash Join  (cost=7801.17..12759.04 rows=136607 width=0) (actual time=33.212..81.760 rows=109286 loops=3)                                              |
|                     Hash Cond: (versioned_purl.base_purl_id = base_purl.id)                                                                                                      |
|                     ->  Parallel Hash Join  (cost=6683.01..11282.26 rows=136607 width=16) (actual time=29.573..62.632 rows=109286 loops=3)                                       |
|                           Hash Cond: (qualified_purl.versioned_purl_id = versioned_purl.id)                                                                                      |
|                           ->  Parallel Index Only Scan using by_pvid on qualified_purl  (cost=0.42..4241.08 rows=136607 width=16) (actual time=0.026..5.534 rows=109286 loops=3) |
|                                 Heap Fetches: 0                                                                                                                                  |
|                           ->  Parallel Hash  (cost=4666.15..4666.15 rows=161315 width=32) (actual time=28.856..28.857 rows=91412 loops=3)                                        |
|                                 Buckets: 524288  Batches: 1  Memory Usage: 21312kB                                                                                               |
|                                 ->  Parallel Seq Scan on versioned_purl  (cost=0.00..4666.15 rows=161315 width=32) (actual time=0.011..7.506 rows=91412 loops=3)                 |
|                     ->  Parallel Hash  (cost=806.44..806.44 rows=24938 width=16) (actual time=3.329..3.329 rows=19950 loops=3)                                                   |
|                           Buckets: 65536  Batches: 1  Memory Usage: 3360kB                                                                                                       |
|                           ->  Parallel Index Only Scan using package_pkey on base_purl  (cost=0.41..806.44 rows=24938 width=16) (actual time=0.020..0.990 rows=19950 loops=3)    |
|                                 Heap Fetches: 0                                                                                                                                  |
| Planning Time: 0.436 ms                                                                                                                                                          |
| Execution Time: 92.927 ms                                                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The parallel seqscan looks suspect though we probably want to nudge the query planner to not aggregate ... investigating

ctron commented 1 month ago

Where is that query coming from? I doubt that number is actually required.

JimFuller-RedHat commented 1 month ago

yes, currently trying to figure out why during ingestion a count - teasing out at scale testing versus a single ingestion ...

JimFuller-RedHat commented 1 month ago

This query is significantly faster (10x) if we drop the WHERE condition:

SELECT COUNT(*) AS num_items FROM (SELECT "qualified_purl"."id", "qualified_purl"."versioned_purl_id", "qualified_purl"."qualifiers" FROM "qualified_purl" LEFT JOIN "versioned_purl" ON "qualified_purl"."ver
 sioned_purl_id" = "versioned_purl"."id" ) AS "sub_query"

can we not make the assumption that everything in versioned_purl has an entry in base_purl ?

JimFuller-RedHat commented 2 weeks ago

probably invoked during use of REST API - found the source of this ... it is a classic postgres count issue buried in sea_orm paginator

sea-orm-1.0.0/src/executor/paginator.rs#68

    pub async fn num_items(&self) -> Result<u64, DbErr> {
        let builder = self.db.get_database_backend();
        let stmt = SelectStatement::new()
            .expr(Expr::cust("COUNT(*) AS num_items"))
            .from_subquery(
                self.query
                    .clone()
                    .reset_limit()
                    .reset_offset()
                    .clear_order_by()
                    .to_owned(),
                Alias::new("sub_query"),
            )
            .to_owned();

We could rewrite this sql differently and/or add some indexes but it will not result in a magnitude change in performance eg. I tested and got about 20% better (not worth the trade off ... more indexes, complexity) - historically the way to fix counts is to use estimates (at scale this often becomes the only way).

FWIW A good (though old) article on pg counts = https://www.citusdata.com/blog/2016/10/12/count-performance/

JimFuller-RedHat commented 2 weeks ago

conversations with @ctron - we should optimise calls to any service that does not need counts, so keeping this issue open for now.

trustification / trustify

slow query during cve ingestion #669