bug: api endpoints dealing with blocks are too slow in stg

altergui commented 1 week ago

since merging #1329 performance regressed significantly, this was not noticed in dev (with <10k blocks), only in stage (>1M blocks indexed), so it must be due to the table size could you take a look @mvdan?

https://github.com/vocdoni/vocdoni-node/blob/b34b5bb661c5f5c393c7f857177d742a90582cae/vochain/indexer/queries/blocks.sql#L24-L46

altergui commented 1 week ago

curl -s "https://api-stg.vocdoni.net/v2/chain/blocks?limit=1" is affected

but fetching one specific block, like curl -s "https://api-stg.vocdoni.net/v2/chain/blocks/1" is not

mvdan commented 1 week ago

With the backup sql file you provided, I wrote a small benchmark to measure how slow the query is, mimicking your REST query with ?limit=1:

func BenchmarkTODO(b *testing.B) {
    app := vochain.TestBaseApplication(b)

    idx, err := New(app, Options{DataDir: b.TempDir(), ExpectBackupRestore: true})
    qt.Assert(b, err, qt.IsNil)
    err = idx.RestoreBackup("testdata/backup.sql")

    b.ReportAllocs()
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        blocks, total, err := idx.BlockList(1, 0, "", "", "")
        qt.Assert(b, err, qt.IsNil)
        qt.Assert(b, blocks, qt.HasLen, 1)
        qt.Assert(b, blocks[0].TxCount, qt.Equals, int64(0))
        qt.Assert(b, total, qt.Equals, uint64(1635162))
    }
}

So there are some good and bad news. Let's start with the good news - simply removing the COUNT(*) OVER() AS total_count from the query basically gives you a near-infinite speed-up, from ~1.5s to ~0.02ms on my laptop:

       │       old        │                 new                 │
       │      sec/op      │   sec/op     vs base                │
TODO-8   1476149.90µ ± 4%   26.43µ ± 4%  -100.00% (p=0.002 n=6)

The bad news is that counting over an entire table in sqlite will always be slow for huge tables. Even in the best case scenario, where we just count all rows via SELECT COUNT(*) FROM blocks, I still measure a runtime of about ~7.5ms, or about 300x slower than the 0.02ms for getting a single block as above. This scales with the size of the table - which is why you only noticed the slow-down with staging, which has about 1.6 million indexed blocks.

And the counting over the entire table scales worse with more complex queries like yours. Because the WHERE, the GROUP BY, the LEFT JOIN - they all have to apply for every row being counted. So not only are you scanning the entire table (which already takes about 300x as long), but you're also doing extra work for every row in the entire table, which takes you from milliseconds to seconds, or another 1000x slow-down.

You basically cannot have filtering (or searching) SQL queries which also count the total number of results. Because, for SQL to compute that, it needs to search the entire table. So any LIMIT or OFFSET parameters are not saving any work - the query will still have to churn CPU to search the entire table.

I think the simplest solution for this query is to drop the COUNT(*) OVER(). I understand it's probably wanted for the sake of providing a total number of pages for the pagination in the UI, and this sort of hack is probably OK for queries on processes or entities, where the total size of the table is manageable (in the order of thousands of entries). But for blocks, where we very easily get into the millions, then it's just not feasible.

altergui commented 1 week ago

thanks! fixed in c3cf08ada328710fb6d0b597de700ece301cfa10

vocdoni / vocdoni-node

bug: api endpoints dealing with blocks are too slow in stg #1382