nodejs / performance

Node.js team focusing on performance
MIT License
371 stars 7 forks source link

V8 Regression with Node.js 22 #166

Open RafaelGSS opened 2 months ago

RafaelGSS commented 2 months ago

The nodejs-bench-operations identified regressions between Node.js 21.7.3 and Node.js 22.1.0

name %
😱 - includes-vs-raw-comparison.md#using Array.includes -90.73%
😱 - includes-vs-raw-comparison.md#using Array.includes (first item) -88.51%
😱 - includes-vs-raw-comparison.md#Using raw comparison -84.26%
😱 - includes-vs-raw-comparison.md#Using raw comparison (first item) -84.15%
😱 - private-property.md#Raw usage private field -85.18%
😱 - private-property.md#Raw usage underscore usage -84.14%
😱 - private-property.md#Manipulating private properties using # -85.36%
😱 - private-property.md#Manipulating private properties using underscore(_) -84.58%
😱 - private-property.md#Manipulating private properties using Symbol -85.08%
😱 - private-property.md#Manipulating private properties using PrivateSymbol -23.35%
😱 - async-function-vs-function.md#function -84.71%
😱 - async-function-vs-function.md#[async] function -35.12%
😱 - deleting-properties.md#Using undefined assignment -84.48%
😱 - deleting-properties.md#Using undefined property (cached proto: null) -83.59%
😱 - compare-using-instanceof.md#[False conditional] Using instanceof only -84.08%
😱 - compare-using-instanceof.md#[False conditional] Using constructor name -84.09%
😱 - compare-using-instanceof.md#[False conditional] Check if property is valid then instanceof -83.85%
😱 - add-property.md#Directly in the object -84.10%
😱 - add-property.md#Using dot notation -83.85%
😱 - object-creation.md#Object.create(null) -38.45%
😱 - object-creation.md#Cached Empty.prototype -84.24%
😱 - optional-chain-vs-and-operator.md#Using optional chain (obj.field?.field2) (Valid) -84.28%
😱 - optional-chain-vs-and-operator.md#Using optional chain (obj.field?.field2) (undefined) -83.83%
😱 - optional-chain-vs-and-operator.md#Using and operator (obj.field && obj.field.field2) (Valid) -84.36%
😱 - optional-chain-vs-and-operator.md#Using and operator (obj.field && obj.field.field2) (undefined) -83.71%
😱 - keys-vs-getownpropertynames.md#Using Object.keys() -40.15%
😱 - keys-vs-getownpropertynames.md#Using Object.getOwnPropertyNames() -39.52%
😱 - parse-int.md#Using parseInt(x, 10) - small number (2 len) -84.40%
😱 - parse-int.md#Using parseInt(x, 10) - big number (10 len) -84.63%
😱 - parse-int.md#Using + - small number (2 len) -84.25%
😱 - parse-int.md#Using + - big number (10 len) -84.03%
😱 - last-array-item.md#Length = 100 - Array.at -86.41%
😱 - last-array-item.md#Length = 10_000 - Array.at -86.57%
😱 - last-array-item.md#Length = 1_000_000 - Array.at -86.74%
😱 - last-array-item.md#Length = 100 - Array[length - 1] -84.76%
😱 - last-array-item.md#Length = 10_000 - Array[length - 1] -84.87%
😱 - last-array-item.md#Length = 1_000_000 - Array[length - 1] -84%

https://github.com/RafaelGSS/nodejs-bench-operations/actions/runs/8946165774/job/24576348629#step:5:56

Source:

lemire commented 2 months ago

A lot of these issues appear to have to do with relatively cheap functions. Maybe there is an additional overhead?

eugene1g commented 2 months ago

A lot of these issues appear to have to do with relatively cheap functions. Maybe there is an additional overhead?

+1, something must be going wrong with the test harness. For example, one test includes-vs-raw-comparison.md#Using raw comparison (first item) shows up as ~10x slower, but the test is a simple string comparison and there is no way that is ~10x slower in V12 vs V11 (maglev or not). I ran a few simple isolated lame benchmarks timing the execution of [].includes etc, and performance is effectively identical between v22 and v20.

eugene1g commented 2 months ago

Theory: Before node v22, V8 applied dead-code elimination and reduced simple test functions to a no-op which artifically inflated performance numbers. For example, there is a simple test case like this -

suite
  .add('using Array.includes', function () {
    const httpVersion = '1.1'
    const exists = ['2.0', '1.0', '1.1'].includes(httpVersion)
  })

This test is simple, and doesn't do much, but it does do something. Yet for v21, it has the same ops/sec as an empty function -

suite
  .add('do nothing', function () { })

Which makes me think that V8 optimizes that test into a no-op. If I modify the test suite slightly like so -

const suite = createBenchmarkSuite('Array.includes vs raw comparison')
let exists; // add a global variable so V8 does not do dead-code-elimination on or test logic
suite
  .add('using Array.includes', function () {
    const httpVersion = '1.1'
    exists = ['2.0', '1.0', '1.1'].includes(httpVersion)
  })
  .add('using Array.includes (first item)', function () {
    const httpVersion = '2.0'
    exists = ['2.0', '1.0', '1.1'].includes(httpVersion)
  })

Then ops/sec for v21 "drops" by ~10x, and becomes comparable to ops/sec measured for v22.1 and comparable to Bun.

Perhaps in V8 12, some code optimization logic changed and those simple tests are no longer reduced to no-op by V8, but actually have to execute code. If true, it's not that Node v22.1 numbers are not slow, it's that the baseline from v21 is wrong.

lemire commented 2 months ago

V8 applied dead-code elimination

It makes a lot of sense.

gurgunday commented 2 months ago

This test is simple, and doesn't do much, but it does do something. Yet for v21, it has the same ops/sec as an empty function

Yeah could be it, looks to me like a reality check for micro benchmarks 😄

gurgunday commented 2 months ago

But even with your suggestion, node 22 seems slower overall in the benchmarks (macOS)

Example

node 20:

## startsWith comparison

|name|ops/sec|samples|
|-|-|-|
|(short string) (true) String#startsWith|155,325,867|99|
|(short string) (true) String#slice and strict comparison|108,541,352|95|
|(long string) (true) String#startsWith|107,012,984|101|
|(long string) (true) String#slice and strict comparison|97,092,528|96|
|(short string) (false) String#startsWith|155,958,923|97|
|(short string) (false) String#slice and strict comparison|106,668,930|99|
|(long string) (false) String#startsWith|152,774,020|96|
|(long string) (false) String#slice and strict comparison|96,296,096|100|

node 22:

## startsWith comparison

|name|ops/sec|samples|
|-|-|-|
|(short string) (true) String#startsWith|109,696,453|97|
|(short string) (true) String#slice and strict comparison|78,338,981|91|
|(long string) (true) String#startsWith|71,833,592|91|
|(long string) (true) String#slice and strict comparison|65,265,958|91|
|(short string) (false) String#startsWith|109,524,264|98|
|(short string) (false) String#slice and strict comparison|77,999,196|91|
|(long string) (false) String#startsWith|105,947,492|90|
|(long string) (false) String#slice and strict comparison|65,677,907|94|

Benchmark (modified from nodejs-bench-operations):

const { createBenchmarkSuite } = require('../common')

const suite = createBenchmarkSuite('startsWith comparison')
const shortString = 'foobar'
const longString = 'foobar'.repeat(100)

const comparison = 'foo'
const comparison2 = 'bar'

let result = false

suite
.add('(short string) (true) String#startsWith', function () {
    result = shortString.startsWith(comparison)
    }
)
.add('(short string) (true) String#slice and strict comparison', function () {
    result = shortString.slice(0, comparison.length) === comparison
    }
)
.add('(long string) (true) String#startsWith', function () {
    result = longString.startsWith(comparison)
    }
)
.add('(long string) (true) String#slice and strict comparison', function () {
    result = longString.slice(0, comparison.length) === comparison
    }
)
.add('(short string) (false) String#startsWith', function () {
    result = shortString.startsWith(comparison2)
    }
)
.add('(short string) (false) String#slice and strict comparison', function () {
    result = shortString.slice(0, comparison2.length) === comparison2
    }
)
.add('(long string) (false) String#startsWith', function () {
    result = longString.startsWith(comparison2)
    }
)
.add('(long string) (false) String#slice and strict comparison', function () {
    result = longString.slice(0, comparison2.length) === comparison2
    }
)
.run({ async: false })
eugene1g commented 2 months ago

But even with your suggestion, node 22 seems slower overall in the benchmarks (macOS)

Yes, and for these micro-benchmarks, I also saw ~30% drop in ops/sec from v20 to v22. However, when I did a dumb manual test by inlining the logic like so -

const longString = 'foobar'.repeat(100)

let result = false
const comparison = 'foo'

console.time('run')
for(let i = 0; i < 100_000_000; i++) {
  result = longString.startsWith(comparison)
}
console.timeEnd('run')

The results were ~identical between v20 and v22. This ignores any V8 optimizations for functions and directly checks this operation's performance. Something changed in how v22 runs the test harness, and that obscures the performance of the actual code under test.

RafaelGSS commented 2 months ago

Have you checked V8 optimization (maybe with --trace-opt/ --trace-deopt / --trace-maglev-inlining)?

I also suspect this is due to how dead-code elimination is put in place with this new compiler. Let me clarify a few things from the nodejs-bench-operations repository. The idea of this repository is to compare two versions of Node.js performing the same operation on the same environment (dedicated linux machine). If version 21 contains a dead-code elimination (noop) and 22 doesn't, it should inform. The percentage is irrelevant in microbenchmarking as it might represent 0.05% of your execution time.

However, If we include operations or disable somehow dead code elimination (for instance, %DoNotOptimize) the microbenchmark won't give us any insight as it's unlikely that operation Y will behave in the same way in production (likely to be optimized). It will be pretty much the same as measuring FLOPS or CPU instructions.

My purpose for bringing up this issue is to suggest that we investigate focusing our efforts on more realistic benchmarks. I have another repository for this which I will make public soon. Something has changed, now, we should understand if that's indeed a regressions (caused by not flagging some operations as "optimizable") or not.

RafaelGSS commented 2 months ago

Please, also consider the environment used for each benchmark (ubuntu 22.04 x64 4 vCPUs 16gb - m6a.xlarge). We are discussing some approaches for a more consistent and "close to reality" in https://github.com/RafaelGSS/nodejs-bench-operations/issues/74.

eugene1g commented 2 months ago

Have you checked V8 optimization (maybe with --trace-opts/ --trace-deopts / --trace-maglev-inlining)?

Yes, I did - I also tried various combinations of that, plus --print-bytecode --allow-natives-syntax to get %OptimizeFunctionOnNextCall and monitoring via %GetOptimizationStatus working. My understanding of that output (bytecode, etc) is limited, but I could not find any meaningful differences between v22 and v20, especially when running v22 with --no-maglev. Bytecode and optimization flags looked the same across both versions, but there is no way that a test code should run as quick as a no-op function unless it's being eliminated. But I couldn't isolate that information in bytecode - maybe V8 does the elimination pass on the bytecode, not before it, so the generated bytecode wouldn't show that. But, again, my understanding of V8 internals is rudimentary.

gurgunday commented 2 months ago

Could be due to TurboShaft, introduced fully in V8 12.0, maybe it doesn't do dead code elimination the same way

Turboshaft: new architecture for the top-tier optimizing compiler

Maglev wasn't our only investment in improved compiler technology. We've also introduced Turboshaft, a new internal architecture for our top-tier optimizing compiler Turbofan, making it both easier to extend with new optimizations and faster at compiling. Since Chrome 120, the CPU-agnostic backend phases all use Turboshaft rather than Turbofan, and compile about twice as fast as before. This is saving energy and is paving the way for more exciting performance gains next year and beyond. Keep an eye out for updates!

https://v8.dev/blog/holiday-season-2023

RafaelGSS commented 1 month ago

So, I have investigated it a bit and it's unlikely to contain a regression, but a different benchmark approach is required.

Before maglev, the benchmarks were optimized directly into TURBOFAN during the benchmark clock. Assume the following analogy:

  1. benchmark.clock.start()
  2. benchmark.start() -> running the bench function several times and
  3. during this process V8 marks the code to optimize and optimize it in the next cycle
    • Interpreted code -> mark for optimization (target: TURBOFAN) -> optimized
  4. benchmark.clock.end()

Note that, the samples were collected including portions of Interpreted code measurements + Turbofan code measurements.

Now, with MAGLEV in the middle, a few more operations are executed during the benchmark clock measurement.

  1. benchmark.clock.start()
  2. benchmark.start() -> running the bench function several times and
  3. during this process V8 marks the code to optimize and optimize it in the next cycle
    • Interpreted code -> mark for optimization (target: MAGLEV) -> optimized
  4. When V8 identifies the code hot & stable it marks the code to optimize targeting TURBOFAN
    • Maglev code -> mark for optimization (target: MAGLEV) -> optimized
  5. benchmark.clock.end()

Therefore, I'm adjusting the nodejs-bench-operations to include:

Hence, for now, disregard these results.


Results obtained with --trace-opt-verbose

[marking 0x15f907a1dd31 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> for optimization to MAGLEV, ConcurrencyMode::kConcurrent, reason: hot and stable]
[completed compiling 0x2f489461c549 <JSFunction get (sfi = 0x112422403231)> (target MAGLEV) - took 0.001, 1.106, 0.008 ms]
[compiling method 0x15f907a1dd31 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target MAGLEV), mode: ConcurrencyMode::kConcurrent]
[completed compiling 0x15f907a1dd31 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target MAGLEV) - took 0.000, 0.041, 0.002 ms]
...
[marking 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> for optimization to TURBOFAN, ConcurrencyMode::kConcurrent, reason: hot and stable]
[compiling method 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target TURBOFAN), mode: ConcurrencyMode::kConcurrent]
[not marking function h (MAGLEV) for optimization: already queued]
[completed compiling 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target TURBOFAN) - took 0.004, 0.181, 0.009 ms]
[completed optimizing 0x278484a0e049 <JSFunction testDirectInObj (sfi = 0x2948236ff449)> (target TURBOFAN)]