ncruces / go-sqlite3

Go bindings to SQLite using wazero
https://pkg.go.dev/github.com/ncruces/go-sqlite3
MIT License
499 stars 16 forks source link

Investigate wazevo amd64 performance #61

Closed ncruces closed 8 months ago

ncruces commented 8 months ago

Now that #60 is fixed, and all tests pass, investigate performance.

$ cd vfs/tests/speedtest1
$ go test -bench=. -benchtime=100x --memdb
goos: darwin
goarch: amd64
pkg: github.com/ncruces/go-sqlite3/vfs/tests/speedtest1
cpu: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz

On wazevo (#60) branch:

Benchmark_speedtest1-12              100         100241318 ns/op
PASS
-- Speedtest1 for SQLite 3.45.1 2024-01-30 16:01:20 e876e51a0ed5c5b3126f52e53204
 100 - 50000 INSERTs into table with no index......................    0.127s
 110 - 50000 ordered INSERTS with one index/PK.....................    0.188s
 120 - 50000 unordered INSERTS with one index/PK...................    0.217s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.279s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.372s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.528s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.284s
 150 - CREATE INDEX five times.....................................    0.404s
 160 - 10000 SELECTS, numeric BETWEEN, indexed.....................    0.172s
 161 - 10000 SELECTS, numeric BETWEEN, PK..........................    0.172s
 170 - 10000 SELECTS, text BETWEEN, indexed........................    0.270s
 180 - 50000 INSERTS with three indexes............................    0.323s
 190 - DELETE and REFILL one table.................................    0.317s
 200 - VACUUM......................................................    0.351s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.016s
 230 - 10000 UPDATES, numeric BETWEEN, indexed.....................    0.191s
 240 - 50000 UPDATES of individual rows............................    0.154s
 250 - One big UPDATE of the whole 50000-row table.................    0.057s
 260 - Query added column after filling............................    0.014s
 270 - 10000 DELETEs, numeric BETWEEN, indexed.....................    0.224s
 280 - 50000 DELETEs of individual rows............................    0.158s
 290 - Refill two 50000-row tables using REPLACE...................    0.632s
 300 - Refill a 50000-row table using (b&1)==(a&1).................    0.341s
 310 - 10000 four-ways joins.......................................    0.448s
 320 - subquery in result set......................................    1.868s
 400 - 70000 REPLACE ops on an IPK.................................    0.217s
 410 - 70000 SELECTS on an IPK.....................................    0.131s
 500 - 70000 REPLACE on TEXT PK....................................    0.261s
 510 - 70000 SELECTS on a TEXT PK..................................    0.248s
 520 - 70000 SELECT DISTINCT.......................................    0.152s
 980 - PRAGMA integrity_check......................................    0.760s
 990 - ANALYZE.....................................................    0.146s
       TOTAL.......................................................   10.022s
ok      github.com/ncruces/go-sqlite3/vfs/tests/speedtest1      12.510s

On main branch:

Benchmark_speedtest1-12              100          90132651 ns/op
PASS
-- Speedtest1 for SQLite 3.45.1 2024-01-30 16:01:20 e876e51a0ed5c5b3126f52e53204
 100 - 50000 INSERTs into table with no index......................    0.142s
 110 - 50000 ordered INSERTS with one index/PK.....................    0.183s
 120 - 50000 unordered INSERTS with one index/PK...................    0.241s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.152s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.355s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.545s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.274s
 150 - CREATE INDEX five times.....................................    0.344s
 160 - 10000 SELECTS, numeric BETWEEN, indexed.....................    0.159s
 161 - 10000 SELECTS, numeric BETWEEN, PK..........................    0.165s
 170 - 10000 SELECTS, text BETWEEN, indexed........................    0.305s
 180 - 50000 INSERTS with three indexes............................    0.295s
 190 - DELETE and REFILL one table.................................    0.326s
 200 - VACUUM......................................................    0.348s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.010s
 230 - 10000 UPDATES, numeric BETWEEN, indexed.....................    0.139s
 240 - 50000 UPDATES of individual rows............................    0.144s
 250 - One big UPDATE of the whole 50000-row table.................    0.039s
 260 - Query added column after filling............................    0.009s
 270 - 10000 DELETEs, numeric BETWEEN, indexed.....................    0.243s
 280 - 50000 DELETEs of individual rows............................    0.154s
 290 - Refill two 50000-row tables using REPLACE...................    0.643s
 300 - Refill a 50000-row table using (b&1)==(a&1).................    0.324s
 310 - 10000 four-ways joins.......................................    0.500s
 320 - subquery in result set......................................    1.062s
 400 - 70000 REPLACE ops on an IPK.................................    0.219s
 410 - 70000 SELECTS on an IPK.....................................    0.134s
 500 - 70000 REPLACE on TEXT PK....................................    0.293s
 510 - 70000 SELECTS on a TEXT PK..................................    0.302s
 520 - 70000 SELECT DISTINCT.......................................    0.161s
 980 - PRAGMA integrity_check......................................    0.705s
 990 - ANALYZE.....................................................    0.096s
       TOTAL.......................................................    9.011s
ok      github.com/ncruces/go-sqlite3/vfs/tests/speedtest1      10.032s
ncruces commented 8 months ago
$ go tool pprof profile.out 

File: speedtest1.test
Type: cpu
Time: Feb 20, 2024 at 3:51pm (WET)
Duration: 10.38s, Total samples = 9.50s (91.53%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top

Showing nodes accounting for 9.41s, 99.05% of 9.50s total
Dropped 31 nodes (cum <= 0.05s)
Showing top 10 nodes out of 21
      flat  flat%   sum%        cum   cum%
     9.01s 94.84% 94.84%      9.01s 94.84%  runtime._ExternalCode
     0.34s  3.58% 98.42%      0.34s  3.58%  runtime.memmove
     0.06s  0.63% 99.05%      0.06s  0.63%  runtime.pthread_cond_signal
         0     0% 99.05%      0.17s  1.79%  github.com/ncruces/go-sqlite3/vfs/tests/speedtest1.Benchmark_speedtest1
         0     0% 99.05%      0.17s  1.79%  github.com/tetratelabs/wazero.(*runtime).InstantiateModule
         0     0% 99.05%      0.17s  1.79%  github.com/tetratelabs/wazero/internal/engine/wazevo.(*callEngine).Call
         0     0% 99.05%      0.17s  1.79%  github.com/tetratelabs/wazero/internal/engine/wazevo.(*callEngine).callWithStack
         0     0% 99.05%      0.17s  1.79%  github.com/tetratelabs/wazero/internal/wasm.(*MemoryInstance).Grow
         0     0% 99.05%      9.01s 94.84%  runtime._System
         0     0% 99.05%      0.06s  0.63%  runtime.gopreempt_m

profile.out.gz

mathetake commented 8 months ago

confirmed the same result on arm64. thank you for this

ncruces commented 8 months ago

Haven't been able to get a decent wzprof out of this. For some reason, it seems I can't keep debug info which it requires.

I'm trying to tweak the build.sh script, but I'm afraid that if I avoid wasm-opt altogether, it invalidates results, as I'm guessing optimized WASM is harder to optimize.

mathetake commented 8 months ago

wazevo has perfmap option to get basic block level perf https://github.com/tetratelabs/wazero/blob/c16735db52fc82b1ee64d4cf169d7a015fab3341/internal/engine/wazevo/wazevoapi/perfmap_enabled.go#L5 so I will use this to find the bottleneck

mathetake commented 8 months ago
diff --git a/vfs/tests/speedtest1/speedtest1_test.go b/vfs/tests/speedtest1/speedtest1_test.go
index efecdaf..1d96835 100644
--- a/vfs/tests/speedtest1/speedtest1_test.go
+++ b/vfs/tests/speedtest1/speedtest1_test.go
@@ -17,6 +17,7 @@ import (
        _ "embed"

        "github.com/tetratelabs/wazero"
+       "github.com/tetratelabs/wazero/experimental/opt"
        "github.com/tetratelabs/wazero/imports/wasi_snapshot_preview1"

        "github.com/ncruces/go-sqlite3/internal/util"
@@ -38,7 +39,7 @@ func TestMain(m *testing.M) {
        initFlags()

        ctx := context.Background()
-       rt = wazero.NewRuntime(ctx)
+       rt = wazero.NewRuntimeWithConfig(ctx, opt.NewRuntimeConfigOptimizingCompiler())
        wasi_snapshot_preview1.MustInstantiate(ctx, rt)
        env := vfs.ExportHostFunctions(rt.NewHostModuleBuilder("env"))
        _, err := env.Instantiate(ctx)

basically you haven't enabled optimizing compiler in speedtest1 bench and with this I see the drastic perf gain 👯

goos: darwin
goarch: arm64
pkg: github.com/ncruces/go-sqlite3/vfs/tests/speedtest1
               │   old.txt   │               new.txt               │
               │   sec/op    │   sec/op     vs base                │
_speedtest1-10   74.74m ± 1%   51.17m ± 0%  -31.54% (p=0.000 n=10)
mathetake commented 8 months ago

just for my information, I took the perf data and gave me some insights image

ncruces commented 8 months ago

Seems this is a non issue on arm64, where wazevo significantly improves performance.

It's still a problem on amd64 where numbers are mixed (certain operations improve, others are are worse).

mathetake commented 8 months ago

took the function granularity perf and like .sqlite3VdbeExec seems a bottle neck

amd64

image

image

arm64

image

image

mathetake commented 8 months ago

ok I identified the bottleneck in regalloc, which potentially could improve any kind of programs, will work on it soon

mathetake commented 8 months ago

with this https://github.com/tetratelabs/wazero/pull/2099 PR, I think the perf improved compared to the current main branch, though it's not like one fell swoop.

$ benchstat old_compiler.txt main.txt new_regalloc.txt
goos: darwin
goarch: amd64
pkg: github.com/ncruces/go-sqlite3/vfs/tests/speedtest1
cpu: VirtualApple @ 2.50GHz
               │ old_compiler.txt │            main.txt             │        new_regalloc.txt         │
               │      sec/op      │    sec/op     vs base           │    sec/op     vs base           │
_speedtest1-10       85.03m ± ∞ ¹   93.12m ± ∞ ¹  ~ (p=1.000 n=1) ²   83.57m ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
ncruces commented 8 months ago

Not sure I have the correct version, I synced to 5ca5cda7069c.

I'm still seeing some mixed results. Like, "subquery in result set" is almost twice slower (still ~1.0s to ~1.8s).

mathetake commented 8 months ago

updated version is here: https://github.com/tetratelabs/wazero/pull/2099/commits/374ee4d4ad247809a92fd39a0464c194ec86eedc

I think still subquery in result set seems the bottleneck but at least 10% faster than the current main

mathetake commented 8 months ago

I think we can finally close this issue with https://github.com/tetratelabs/wazero/pull/2109 ... could you do the doublecheck? @ncruces

ncruces commented 8 months ago

On wazevo (#60) branch:

Benchmark_speedtest1-12              100          79233332 ns/op
PASS
-- Speedtest1 for SQLite 3.45.1 2024-01-30 16:01:20 e876e51a0ed5c5b3126f52e53204
 100 - 50000 INSERTs into table with no index......................    0.109s
 110 - 50000 ordered INSERTS with one index/PK.....................    0.148s
 120 - 50000 unordered INSERTS with one index/PK...................    0.179s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.174s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.305s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.420s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.272s
 150 - CREATE INDEX five times.....................................    0.324s
 160 - 10000 SELECTS, numeric BETWEEN, indexed.....................    0.130s
 161 - 10000 SELECTS, numeric BETWEEN, PK..........................    0.132s
 170 - 10000 SELECTS, text BETWEEN, indexed........................    0.222s
 180 - 50000 INSERTS with three indexes............................    0.252s
 190 - DELETE and REFILL one table.................................    0.241s
 200 - VACUUM......................................................    0.305s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.010s
 230 - 10000 UPDATES, numeric BETWEEN, indexed.....................    0.133s
 240 - 50000 UPDATES of individual rows............................    0.132s
 250 - One big UPDATE of the whole 50000-row table.................    0.035s
 260 - Query added column after filling............................    0.009s
 270 - 10000 DELETEs, numeric BETWEEN, indexed.....................    0.219s
 280 - 50000 DELETEs of individual rows............................    0.177s
 290 - Refill two 50000-row tables using REPLACE...................    0.528s
 300 - Refill a 50000-row table using (b&1)==(a&1).................    0.305s
 310 - 10000 four-ways joins.......................................    0.468s
 320 - subquery in result set......................................    1.112s
 400 - 70000 REPLACE ops on an IPK.................................    0.176s
 410 - 70000 SELECTS on an IPK.....................................    0.107s
 500 - 70000 REPLACE on TEXT PK....................................    0.234s
 510 - 70000 SELECTS on a TEXT PK..................................    0.223s
 520 - 70000 SELECT DISTINCT.......................................    0.130s
 980 - PRAGMA integrity_check......................................    0.610s
 990 - ANALYZE.....................................................    0.099s
       TOTAL.......................................................    7.920s
ok      github.com/ncruces/go-sqlite3/vfs/tests/speedtest1      10.880s

On main branch:

Benchmark_speedtest1-12              100         101452988 ns/op
PASS
-- Speedtest1 for SQLite 3.45.1 2024-01-30 16:01:20 e876e51a0ed5c5b3126f52e53204
 100 - 50000 INSERTs into table with no index......................    0.125s
 110 - 50000 ordered INSERTS with one index/PK.....................    0.183s
 120 - 50000 unordered INSERTS with one index/PK...................    0.226s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.199s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.362s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.485s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.349s
 150 - CREATE INDEX five times.....................................    0.422s
 160 - 10000 SELECTS, numeric BETWEEN, indexed.....................    0.164s
 161 - 10000 SELECTS, numeric BETWEEN, PK..........................    0.162s
 170 - 10000 SELECTS, text BETWEEN, indexed........................    0.316s
 180 - 50000 INSERTS with three indexes............................    0.360s
 190 - DELETE and REFILL one table.................................    0.401s
 200 - VACUUM......................................................    0.365s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.010s
 230 - 10000 UPDATES, numeric BETWEEN, indexed.....................    0.184s
 240 - 50000 UPDATES of individual rows............................    0.203s
 250 - One big UPDATE of the whole 50000-row table.................    0.046s
 260 - Query added column after filling............................    0.011s
 270 - 10000 DELETEs, numeric BETWEEN, indexed.....................    0.310s
 280 - 50000 DELETEs of individual rows............................    0.181s
 290 - Refill two 50000-row tables using REPLACE...................    0.750s
 300 - Refill a 50000-row table using (b&1)==(a&1).................    0.378s
 310 - 10000 four-ways joins.......................................    0.625s
 320 - subquery in result set......................................    1.246s
 400 - 70000 REPLACE ops on an IPK.................................    0.297s
 410 - 70000 SELECTS on an IPK.....................................    0.153s
 500 - 70000 REPLACE on TEXT PK....................................    0.320s
 510 - 70000 SELECTS on a TEXT PK..................................    0.332s
 520 - 70000 SELECT DISTINCT.......................................    0.162s
 980 - PRAGMA integrity_check......................................    0.717s
 990 - ANALYZE.....................................................    0.098s
       TOTAL.......................................................   10.142s
ok      github.com/ncruces/go-sqlite3/vfs/tests/speedtest1      10.897s