Open RafaelGSS opened 2 months ago
A lot of these issues appear to have to do with relatively cheap functions. Maybe there is an additional overhead?
A lot of these issues appear to have to do with relatively cheap functions. Maybe there is an additional overhead?
+1, something must be going wrong with the test harness. For example, one test includes-vs-raw-comparison.md#Using raw comparison (first item)
shows up as ~10x slower, but the test is a simple string comparison and there is no way that is ~10x slower in V12 vs V11 (maglev or not). I ran a few simple isolated lame benchmarks timing the execution of [].includes
etc, and performance is effectively identical between v22 and v20.
Theory: Before node v22, V8 applied dead-code elimination and reduced simple test functions to a no-op which artifically inflated performance numbers. For example, there is a simple test case like this -
suite
.add('using Array.includes', function () {
const httpVersion = '1.1'
const exists = ['2.0', '1.0', '1.1'].includes(httpVersion)
})
This test is simple, and doesn't do much, but it does do something. Yet for v21, it has the same ops/sec
as an empty function -
suite
.add('do nothing', function () { })
Which makes me think that V8 optimizes that test into a no-op. If I modify the test suite slightly like so -
const suite = createBenchmarkSuite('Array.includes vs raw comparison')
let exists; // add a global variable so V8 does not do dead-code-elimination on or test logic
suite
.add('using Array.includes', function () {
const httpVersion = '1.1'
exists = ['2.0', '1.0', '1.1'].includes(httpVersion)
})
.add('using Array.includes (first item)', function () {
const httpVersion = '2.0'
exists = ['2.0', '1.0', '1.1'].includes(httpVersion)
})
Then ops/sec
for v21 "drops" by ~10x, and becomes comparable to ops/sec
measured for v22.1 and comparable to Bun.
Perhaps in V8 12, some code optimization logic changed and those simple tests are no longer reduced to no-op by V8, but actually have to execute code. If true, it's not that Node v22.1 numbers are not slow, it's that the baseline from v21 is wrong.
V8 applied dead-code elimination
It makes a lot of sense.
This test is simple, and doesn't do much, but it does do something. Yet for v21, it has the same ops/sec as an empty function
Yeah could be it, looks to me like a reality check for micro benchmarks 😄
But even with your suggestion, node 22 seems slower overall in the benchmarks (macOS)
node 20:
## startsWith comparison
|name|ops/sec|samples|
|-|-|-|
|(short string) (true) String#startsWith|155,325,867|99|
|(short string) (true) String#slice and strict comparison|108,541,352|95|
|(long string) (true) String#startsWith|107,012,984|101|
|(long string) (true) String#slice and strict comparison|97,092,528|96|
|(short string) (false) String#startsWith|155,958,923|97|
|(short string) (false) String#slice and strict comparison|106,668,930|99|
|(long string) (false) String#startsWith|152,774,020|96|
|(long string) (false) String#slice and strict comparison|96,296,096|100|
node 22:
## startsWith comparison
|name|ops/sec|samples|
|-|-|-|
|(short string) (true) String#startsWith|109,696,453|97|
|(short string) (true) String#slice and strict comparison|78,338,981|91|
|(long string) (true) String#startsWith|71,833,592|91|
|(long string) (true) String#slice and strict comparison|65,265,958|91|
|(short string) (false) String#startsWith|109,524,264|98|
|(short string) (false) String#slice and strict comparison|77,999,196|91|
|(long string) (false) String#startsWith|105,947,492|90|
|(long string) (false) String#slice and strict comparison|65,677,907|94|
Benchmark (modified from nodejs-bench-operations):
const { createBenchmarkSuite } = require('../common')
const suite = createBenchmarkSuite('startsWith comparison')
const shortString = 'foobar'
const longString = 'foobar'.repeat(100)
const comparison = 'foo'
const comparison2 = 'bar'
let result = false
suite
.add('(short string) (true) String#startsWith', function () {
result = shortString.startsWith(comparison)
}
)
.add('(short string) (true) String#slice and strict comparison', function () {
result = shortString.slice(0, comparison.length) === comparison
}
)
.add('(long string) (true) String#startsWith', function () {
result = longString.startsWith(comparison)
}
)
.add('(long string) (true) String#slice and strict comparison', function () {
result = longString.slice(0, comparison.length) === comparison
}
)
.add('(short string) (false) String#startsWith', function () {
result = shortString.startsWith(comparison2)
}
)
.add('(short string) (false) String#slice and strict comparison', function () {
result = shortString.slice(0, comparison2.length) === comparison2
}
)
.add('(long string) (false) String#startsWith', function () {
result = longString.startsWith(comparison2)
}
)
.add('(long string) (false) String#slice and strict comparison', function () {
result = longString.slice(0, comparison2.length) === comparison2
}
)
.run({ async: false })
But even with your suggestion, node 22 seems slower overall in the benchmarks (macOS)
Yes, and for these micro-benchmarks, I also saw ~30% drop in ops/sec
from v20 to v22. However, when I did a dumb manual test by inlining the logic like so -
const longString = 'foobar'.repeat(100)
let result = false
const comparison = 'foo'
console.time('run')
for(let i = 0; i < 100_000_000; i++) {
result = longString.startsWith(comparison)
}
console.timeEnd('run')
The results were ~identical between v20 and v22. This ignores any V8 optimizations for functions and directly checks this operation's performance. Something changed in how v22 runs the test harness, and that obscures the performance of the actual code under test.
Have you checked V8 optimization (maybe with --trace-opt
/ --trace-deopt
/ --trace-maglev-inlining
)?
I also suspect this is due to how dead-code elimination is put in place with this new compiler. Let me clarify a few things from the nodejs-bench-operations
repository. The idea of this repository is to compare two versions of Node.js performing the same operation on the same environment (dedicated linux machine). If version 21 contains a dead-code elimination (noop) and 22 doesn't, it should inform. The percentage is irrelevant in microbenchmarking as it might represent 0.05% of your execution time.
However, If we include operations or disable somehow dead code elimination (for instance, %DoNotOptimize) the microbenchmark won't give us any insight as it's unlikely that operation Y will behave in the same way in production (likely to be optimized). It will be pretty much the same as measuring FLOPS or CPU instructions.
My purpose for bringing up this issue is to suggest that we investigate focusing our efforts on more realistic benchmarks. I have another repository for this which I will make public soon. Something has changed, now, we should understand if that's indeed a regressions (caused by not flagging some operations as "optimizable") or not.
Please, also consider the environment used for each benchmark (ubuntu 22.04 x64 4 vCPUs 16gb - m6a.xlarge). We are discussing some approaches for a more consistent and "close to reality" in https://github.com/RafaelGSS/nodejs-bench-operations/issues/74.
Have you checked V8 optimization (maybe with
--trace-opts
/--trace-deopts
/--trace-maglev-inlining
)?
Yes, I did - I also tried various combinations of that, plus --print-bytecode --allow-natives-syntax
to get %OptimizeFunctionOnNextCall
and monitoring via %GetOptimizationStatus
working. My understanding of that output (bytecode, etc) is limited, but I could not find any meaningful differences between v22 and v20, especially when running v22 with --no-maglev
. Bytecode and optimization flags looked the same across both versions, but there is no way that a test code should run as quick as a no-op function unless it's being eliminated. But I couldn't isolate that information in bytecode - maybe V8 does the elimination pass on the bytecode, not before it, so the generated bytecode wouldn't show that. But, again, my understanding of V8 internals is rudimentary.
Could be due to TurboShaft, introduced fully in V8 12.0, maybe it doesn't do dead code elimination the same way
Maglev wasn't our only investment in improved compiler technology. We've also introduced Turboshaft, a new internal architecture for our top-tier optimizing compiler Turbofan, making it both easier to extend with new optimizations and faster at compiling. Since Chrome 120, the CPU-agnostic backend phases all use Turboshaft rather than Turbofan, and compile about twice as fast as before. This is saving energy and is paving the way for more exciting performance gains next year and beyond. Keep an eye out for updates!
So, I have investigated it a bit and it's unlikely to contain a regression, but a different benchmark approach is required.
Before maglev, the benchmarks were optimized directly into TURBOFAN during the benchmark clock. Assume the following analogy:
Note that, the samples were collected including portions of Interpreted code measurements + Turbofan code measurements.
Now, with MAGLEV in the middle, a few more operations are executed during the benchmark clock measurement.
hot & stable
it marks the code to optimize targeting TURBOFAN
Therefore, I'm adjusting the nodejs-bench-operations
to include:
Hence, for now, disregard these results.
Results obtained with --trace-opt-verbose
[marking 0x15f907a1dd31 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> for optimization to MAGLEV, ConcurrencyMode::kConcurrent, reason: hot and stable]
[completed compiling 0x2f489461c549 <JSFunction get (sfi = 0x112422403231)> (target MAGLEV) - took 0.001, 1.106, 0.008 ms]
[compiling method 0x15f907a1dd31 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target MAGLEV), mode: ConcurrencyMode::kConcurrent]
[completed compiling 0x15f907a1dd31 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target MAGLEV) - took 0.000, 0.041, 0.002 ms]
...
[marking 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> for optimization to TURBOFAN, ConcurrencyMode::kConcurrent, reason: hot and stable]
[compiling method 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target TURBOFAN), mode: ConcurrencyMode::kConcurrent]
[not marking function h (MAGLEV) for optimization: already queued]
[completed compiling 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target TURBOFAN) - took 0.004, 0.181, 0.009 ms]
[completed optimizing 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target TURBOFAN)]
The nodejs-bench-operations identified regressions between Node.js 21.7.3 and Node.js 22.1.0
https://github.com/RafaelGSS/nodejs-bench-operations/actions/runs/8946165774/job/24576348629#step:5:56
Source: