Closed ncruces closed 8 months ago
$ go tool pprof profile.out
File: speedtest1.test
Type: cpu
Time: Feb 20, 2024 at 3:51pm (WET)
Duration: 10.38s, Total samples = 9.50s (91.53%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 9.41s, 99.05% of 9.50s total
Dropped 31 nodes (cum <= 0.05s)
Showing top 10 nodes out of 21
flat flat% sum% cum cum%
9.01s 94.84% 94.84% 9.01s 94.84% runtime._ExternalCode
0.34s 3.58% 98.42% 0.34s 3.58% runtime.memmove
0.06s 0.63% 99.05% 0.06s 0.63% runtime.pthread_cond_signal
0 0% 99.05% 0.17s 1.79% github.com/ncruces/go-sqlite3/vfs/tests/speedtest1.Benchmark_speedtest1
0 0% 99.05% 0.17s 1.79% github.com/tetratelabs/wazero.(*runtime).InstantiateModule
0 0% 99.05% 0.17s 1.79% github.com/tetratelabs/wazero/internal/engine/wazevo.(*callEngine).Call
0 0% 99.05% 0.17s 1.79% github.com/tetratelabs/wazero/internal/engine/wazevo.(*callEngine).callWithStack
0 0% 99.05% 0.17s 1.79% github.com/tetratelabs/wazero/internal/wasm.(*MemoryInstance).Grow
0 0% 99.05% 9.01s 94.84% runtime._System
0 0% 99.05% 0.06s 0.63% runtime.gopreempt_m
confirmed the same result on arm64. thank you for this
Haven't been able to get a decent wzprof
out of this. For some reason, it seems I can't keep debug info which it requires.
I'm trying to tweak the build.sh
script, but I'm afraid that if I avoid wasm-opt
altogether, it invalidates results, as I'm guessing optimized WASM is harder to optimize.
wazevo has perfmap option to get basic block level perf https://github.com/tetratelabs/wazero/blob/c16735db52fc82b1ee64d4cf169d7a015fab3341/internal/engine/wazevo/wazevoapi/perfmap_enabled.go#L5 so I will use this to find the bottleneck
diff --git a/vfs/tests/speedtest1/speedtest1_test.go b/vfs/tests/speedtest1/speedtest1_test.go
index efecdaf..1d96835 100644
--- a/vfs/tests/speedtest1/speedtest1_test.go
+++ b/vfs/tests/speedtest1/speedtest1_test.go
@@ -17,6 +17,7 @@ import (
_ "embed"
"github.com/tetratelabs/wazero"
+ "github.com/tetratelabs/wazero/experimental/opt"
"github.com/tetratelabs/wazero/imports/wasi_snapshot_preview1"
"github.com/ncruces/go-sqlite3/internal/util"
@@ -38,7 +39,7 @@ func TestMain(m *testing.M) {
initFlags()
ctx := context.Background()
- rt = wazero.NewRuntime(ctx)
+ rt = wazero.NewRuntimeWithConfig(ctx, opt.NewRuntimeConfigOptimizingCompiler())
wasi_snapshot_preview1.MustInstantiate(ctx, rt)
env := vfs.ExportHostFunctions(rt.NewHostModuleBuilder("env"))
_, err := env.Instantiate(ctx)
basically you haven't enabled optimizing compiler in speedtest1 bench and with this I see the drastic perf gain 👯
goos: darwin
goarch: arm64
pkg: github.com/ncruces/go-sqlite3/vfs/tests/speedtest1
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
_speedtest1-10 74.74m ± 1% 51.17m ± 0% -31.54% (p=0.000 n=10)
just for my information, I took the perf data and gave me some insights
Seems this is a non issue on arm64
, where wazevo significantly improves performance.
It's still a problem on amd64
where numbers are mixed (certain operations improve, others are are worse).
took the function granularity perf and like .sqlite3VdbeExec
seems a bottle neck
ok I identified the bottleneck in regalloc, which potentially could improve any kind of programs, will work on it soon
with this https://github.com/tetratelabs/wazero/pull/2099 PR, I think the perf improved compared to the current main branch, though it's not like one fell swoop.
$ benchstat old_compiler.txt main.txt new_regalloc.txt
goos: darwin
goarch: amd64
pkg: github.com/ncruces/go-sqlite3/vfs/tests/speedtest1
cpu: VirtualApple @ 2.50GHz
│ old_compiler.txt │ main.txt │ new_regalloc.txt │
│ sec/op │ sec/op vs base │ sec/op vs base │
_speedtest1-10 85.03m ± ∞ ¹ 93.12m ± ∞ ¹ ~ (p=1.000 n=1) ² 83.57m ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
Not sure I have the correct version, I synced to 5ca5cda7069c
.
I'm still seeing some mixed results. Like, "subquery in result set" is almost twice slower (still ~1.0s to ~1.8s).
updated version is here: https://github.com/tetratelabs/wazero/pull/2099/commits/374ee4d4ad247809a92fd39a0464c194ec86eedc
I think still subquery in result set
seems the bottleneck but at least 10% faster than the current main
I think we can finally close this issue with https://github.com/tetratelabs/wazero/pull/2109 ... could you do the doublecheck? @ncruces
On wazevo
(#60) branch:
Benchmark_speedtest1-12 100 79233332 ns/op
PASS
-- Speedtest1 for SQLite 3.45.1 2024-01-30 16:01:20 e876e51a0ed5c5b3126f52e53204
100 - 50000 INSERTs into table with no index...................... 0.109s
110 - 50000 ordered INSERTS with one index/PK..................... 0.148s
120 - 50000 unordered INSERTS with one index/PK................... 0.179s
130 - 25 SELECTS, numeric BETWEEN, unindexed...................... 0.174s
140 - 10 SELECTS, LIKE, unindexed................................. 0.305s
142 - 10 SELECTS w/ORDER BY, unindexed............................ 0.420s
145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed.................. 0.272s
150 - CREATE INDEX five times..................................... 0.324s
160 - 10000 SELECTS, numeric BETWEEN, indexed..................... 0.130s
161 - 10000 SELECTS, numeric BETWEEN, PK.......................... 0.132s
170 - 10000 SELECTS, text BETWEEN, indexed........................ 0.222s
180 - 50000 INSERTS with three indexes............................ 0.252s
190 - DELETE and REFILL one table................................. 0.241s
200 - VACUUM...................................................... 0.305s
210 - ALTER TABLE ADD COLUMN, and query........................... 0.010s
230 - 10000 UPDATES, numeric BETWEEN, indexed..................... 0.133s
240 - 50000 UPDATES of individual rows............................ 0.132s
250 - One big UPDATE of the whole 50000-row table................. 0.035s
260 - Query added column after filling............................ 0.009s
270 - 10000 DELETEs, numeric BETWEEN, indexed..................... 0.219s
280 - 50000 DELETEs of individual rows............................ 0.177s
290 - Refill two 50000-row tables using REPLACE................... 0.528s
300 - Refill a 50000-row table using (b&1)==(a&1)................. 0.305s
310 - 10000 four-ways joins....................................... 0.468s
320 - subquery in result set...................................... 1.112s
400 - 70000 REPLACE ops on an IPK................................. 0.176s
410 - 70000 SELECTS on an IPK..................................... 0.107s
500 - 70000 REPLACE on TEXT PK.................................... 0.234s
510 - 70000 SELECTS on a TEXT PK.................................. 0.223s
520 - 70000 SELECT DISTINCT....................................... 0.130s
980 - PRAGMA integrity_check...................................... 0.610s
990 - ANALYZE..................................................... 0.099s
TOTAL....................................................... 7.920s
ok github.com/ncruces/go-sqlite3/vfs/tests/speedtest1 10.880s
On main
branch:
Benchmark_speedtest1-12 100 101452988 ns/op
PASS
-- Speedtest1 for SQLite 3.45.1 2024-01-30 16:01:20 e876e51a0ed5c5b3126f52e53204
100 - 50000 INSERTs into table with no index...................... 0.125s
110 - 50000 ordered INSERTS with one index/PK..................... 0.183s
120 - 50000 unordered INSERTS with one index/PK................... 0.226s
130 - 25 SELECTS, numeric BETWEEN, unindexed...................... 0.199s
140 - 10 SELECTS, LIKE, unindexed................................. 0.362s
142 - 10 SELECTS w/ORDER BY, unindexed............................ 0.485s
145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed.................. 0.349s
150 - CREATE INDEX five times..................................... 0.422s
160 - 10000 SELECTS, numeric BETWEEN, indexed..................... 0.164s
161 - 10000 SELECTS, numeric BETWEEN, PK.......................... 0.162s
170 - 10000 SELECTS, text BETWEEN, indexed........................ 0.316s
180 - 50000 INSERTS with three indexes............................ 0.360s
190 - DELETE and REFILL one table................................. 0.401s
200 - VACUUM...................................................... 0.365s
210 - ALTER TABLE ADD COLUMN, and query........................... 0.010s
230 - 10000 UPDATES, numeric BETWEEN, indexed..................... 0.184s
240 - 50000 UPDATES of individual rows............................ 0.203s
250 - One big UPDATE of the whole 50000-row table................. 0.046s
260 - Query added column after filling............................ 0.011s
270 - 10000 DELETEs, numeric BETWEEN, indexed..................... 0.310s
280 - 50000 DELETEs of individual rows............................ 0.181s
290 - Refill two 50000-row tables using REPLACE................... 0.750s
300 - Refill a 50000-row table using (b&1)==(a&1)................. 0.378s
310 - 10000 four-ways joins....................................... 0.625s
320 - subquery in result set...................................... 1.246s
400 - 70000 REPLACE ops on an IPK................................. 0.297s
410 - 70000 SELECTS on an IPK..................................... 0.153s
500 - 70000 REPLACE on TEXT PK.................................... 0.320s
510 - 70000 SELECTS on a TEXT PK.................................. 0.332s
520 - 70000 SELECT DISTINCT....................................... 0.162s
980 - PRAGMA integrity_check...................................... 0.717s
990 - ANALYZE..................................................... 0.098s
TOTAL....................................................... 10.142s
ok github.com/ncruces/go-sqlite3/vfs/tests/speedtest1 10.897s
Now that #60 is fixed, and all tests pass, investigate performance.
On
wazevo
(#60) branch:On
main
branch: