`process.report.getReport` slower from 20.2.0 to 20.3.0 on Linux

H4ad commented 1 year ago

Script:

const now = performance.now();
process.report.getReport()
console.log(`Took ${performance.now() - now}ms`);

On Node 20.2.0:

> node get-report.js
Took 26.467950999736786ms

On node 20.3.0

> node get-report.js
609.6318630054593ms

I did a bisect and the commit that introduced this slowdown was https://github.com/nodejs/node/commit/bfcb3d1d9a876f399013d326bd65804f9eda77e4, which is the upgrade of libuv.

But the actual problem was introduced (I think) at https://github.com/libuv/libuv/commit/434eb4b0ac29e853e85562119fda3fa681642f17, the slowdown is caused because instead of assuming the same information for all CPUs, now libuv read the /sys/devices/system/cpu/cpu%u/cpufreq/scaling_cur_freq for every CPU.

I'll also open an issue on libuv, but I want to keep it open just to track this slowdown.

My machine:

Ryzen 9 5950X 3.4GHz, 16-Cores 32-Threads
32GB of Memory.

cc @nodejs/performance

H4ad commented 1 year ago

As discussed at https://github.com/libuv/libuv/issues/4098, the slowdown is not related to libuv itself, for some reason it's slow to read data about actual frequency, even cat can be slow:

time sudo cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

On my machine it took 600ms (32 cpu), on another machine it took less than 1ms (12 cpu), and on another, it took 1.6s (16 cpu), so this problem is very strange. If we call cat multiple times it will return the fast value, so if you wait 2s to call it will get slow again.

I proposed to add a flag on uv_cpu_info to choose which way we want to get the CPU frequency data, in theory, we can go back to the old behavior of getting the approximate frequency (and maybe we can introduce a new API to get the real one) but unless a from the libuv maintainers reopen the issue, it will not be implemented.

In this case, unless we remove the call for uv_cpu_info, Linux machines can have this slowdown depending on many factors, not only CPUs.

This issue could mainly affect native bindings with NodeJS because they use getReport to detect if the system is musl, but I already created PRs to avoid using it on detect-libc and napi-rs, so at least for the most popular packages, it will not be an issue.

feisuzhu commented 3 days ago

Just run npm install xxx on a 384 cores machine and it stuck forever, and then found this issue.

The uv_cpu_info is kept called and I don't know when will it stop.

2024-11-12-223125_1722x620_scrot

feisuzhu commented 3 days ago

root@6c92459bf81a:/# node a.js
Took 3862.274124ms
root@6c92459bf81a:/#

H4ad commented 3 days ago

Oh, that's why I saw random slowdown/stuck in my machine, probably introduced by https://github.com/npm/npm-install-checks/commit/ebf9b9f4c08035d8e2c41cd9bc3302cd8bdc9184

/cc @nodejs/npm

H4ad commented 3 days ago

We can use the implementation that I created for https://github.com/lovell/detect-libc/pull/19, is much faster and solve the issue with the slowdown caused by getReport.

I will probably open a PR soon (when I had more time to investigate and confirm the issue)

feisuzhu commented 3 days ago

For anyone run into this, here's a workaround (if you are using glibc, e.g. Ubuntu, change to musl for if you are using Alpine)

cat <<EOF >> /usr/lib/node_modules/npm/node_modules/npm-install-checks/lib/current-env.js
function libc(osname) {
    return 'glibc';
}
EOF
npm install -g your-favorite-app

Tofandel commented 3 days ago

Related #55576 with fix #55602

100% sure this issue is with uv_cpu_info?

We do really need to add options to getReport so we can choose specifically which info we are looking for to avoid slowdowns

H4ad commented 3 days ago

@Tofandel the network could affect but the main issue is probably the uv_cpu_info that we can't avoid without introducing a new flag, which I think could be the best scenario along with updating the npm code.

But I can later try to build nodejs in your branch to confirm.

Tofandel commented 3 days ago

Is it 384 core or logical processors? I'm guessing if you have 384 core you have double the logical processors and uv cpu info runs for each of the core synchronously, and the function might not be very fast

4 seconds for a debug function would be considered long but acceptable, but since it's not only used as a debug function, then instead of adding more and more flags to it, we just add options and if possible make it async

feisuzhu commented 3 days ago

@Tofandel My bad, ~~the SMT is off so only 192 real cores~~ It's actually 1-way node with SMT on

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   192
  On-line CPU(s) list:    0-191
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 9654 96-Core Processor
    CPU family:           25
    Model:                17
    Thread(s) per core:   2
    Core(s) per socket:   96
    Socket(s):            1
    Stepping:             1
    Frequency boost:      enabled
    CPU max MHz:          3707.8120
    CPU min MHz:          1500.0000
    BogoMIPS:             4800.18
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
                          od nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm c
                          mp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_
                          l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx51
                          2ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xs
                          aveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vg
                          if v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flus
                          h_l1d
Virtualization features:
  Virtualization:         AMD-V
Caches (sum of all):
  L1d:                    3 MiB (96 instances)
  L1i:                    3 MiB (96 instances)
  L2:                     96 MiB (96 instances)
  L3:                     384 MiB (12 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-191
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Mitigation; safe RET
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Tofandel commented 3 days ago

It seems its actually 96 real cores and 192 logical cores (1 socket with 96 core per socket * 2 threads per core)

Is it maybe an AMD specific issue? Would be nice to see if it's a common factor to see where to look

feisuzhu commented 3 days ago

Hmmm you are right, thought it was a 2-way machine

Tofandel commented 3 days ago

It seems it's just randomly slow for me as well (8 real cores - Intel), so not just AMD

 time sudo cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
4000009
3999975
4000000
4000000
4000031
3999999
4000040
4000113

real    0m0.010s
user    0m0.007s
sys     0m0.003s

time sudo cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
4083445
4027384
4078306
4092856
4023428
4094010
4097951
4153294

real    0m0.166s
user    0m0.009s
sys     0m0.001s

It seems to depend on what the cpu is doing at the moment, if the cpu is at rest with a lot of idle time it becomes slow

nodejs / node

`process.report.getReport` slower from 20.2.0 to 20.3.0 on Linux #48831