100% CPU usage when running doc tests

tingerrr commented 2 months ago

The Problem

Since a while, and I don't know since when, running any variant of cargo test --doc puts my CPU at 100% so hard, that it crashes my PC if I don't Ctrl+C early enough. Running other tests works just fine, namely unit tests, I haven't tried running any integration tests.

I expected to see this happen: It runs the doc tests basically as fast as regular tests.

Instead, this happened: My PC locks up and sometimes crashes.

Meta

Toolchains

I tried this on various versions, and it has persisted across all of them:

1.82-nightly

rustc 1.82.0-nightly (612a33f20 2024-07-29)
binary: rustc
commit-hash: 612a33f20b9b2c27380edbc4b26a01433ed114bc
commit-date: 2024-07-29
host: x86_64-unknown-linux-gnu
release: 1.82.0-nightly
LLVM version: 18.1.7

1.80-stable

rustc 1.80.0 (051478957 2024-07-21)
binary: rustc
commit-hash: 051478957371ee0084a7c0913941d2a8c4757bb9
commit-date: 2024-07-21
host: x86_64-unknown-linux-gnu
release: 1.80.0
LLVM version: 18.1.7

1.79.1-stable
- Don't have it installed anymore.

rustc 1.74.1

rustc 1.74.1 (a28077b28 2023-12-04)
binary: rustc
commit-hash: a28077b28a02b92985b3a3faecf92813155f1ea1
commit-date: 2023-12-04
host: x86_64-unknown-linux-gnu
release: 1.74.1
LLVM version: 17.0.4

Repositories

I have tried this on various repositories like:

cargo +1.80 test --doc --workspace on typst-test#a8aea584, fails on the typst-test-lib crate
cargo +1.74.1 test --doc --workspace on typst-test#599717fc, fails on the typst-test-lib crate
cargo +1.80 test --doc --workspace on helix#3fcf168c, fails mostly on the helix-tui crate
cargo +1.80 test --doc --workspace on typst#c4dd6fa0, largely doesn't fail, due to the lack of runnable tests

Hardware

This happens consistently on two PCs:

PC 1

`Linux twinkbook 6.10.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 27 Jul 2024 16:49:55 +0000 x86_64 GNU/Linux` Output of `lscpu`: ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz CPU family: 6 Model: 140 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 24% CPU max MHz: 4700,0000 CPU min MHz: 400,0000 BogoMIPS: 5608,00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse 36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopo logy nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdr and lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsg sbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx51 2dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp _epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vae s vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear ibt flush_l1d arch_capab ilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 192 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 5 MiB (4 instances) L3: 12 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Mitigation; Microcode Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Srbds: Not affected Tsx async abort: Not affected ```

PC 2

`Linux hsh 6.10.0-arch1-2 #1 SMP PREEMPT_DYNAMIC Mon, 22 Jul 2024 17:28:23 +0000 x86_64 GNU/Linux ` Output of `lscpu`: ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz CPU family: 6 Model: 158 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 9 CPU(s) scaling MHz: 18% CPU max MHz: 4500.0000 CPU min MHz: 800.0000 BogoMIPS: 8403.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xto pology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 8 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Mitigation; Microcode Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Mds: Mitigation; Clear CPU buffers; SMT vulnerable Meltdown: Mitigation; PTI Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Reg file data sampling: Not affected Retbleed: Mitigation; IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Srbds: Mitigation; Microcode Tsx async abort: Mitigation; TSX disabled ```

I'll happily add more info if requested.

Noratrieb commented 2 months ago

That's really weird? Do you have any other weird behavior with other programs?

tingerrr commented 2 months ago

That's really weird? Do you have any other weird behavior with other programs?

No problems with programs causing the PCs to randomly hang, when I run normal tests I get 100% CPU too at times, but the PC is perfectly usable, no lock up of input device IO.

PC 2 had problems with Discord freezing after I installed the KDE desktop-portal backend, running arch and bspwm on that one. I doubt this is related.

Noratrieb commented 2 months ago

I tried to reproduce this one locally:

cargo +1.80 test --doc --workspace on typst-test#a8aea584, fails on the typst-test-lib crate

It completes within a few seconds for me locally, but it does use about 8 cores at 100% (lds) for a few seconds before executing. Given that your CPU only has 4*2 cores, that will indeed use all your CPU (I have 16*2 cores, so it didn't cause issues for me).

The linker invocation suggests that there's LTO happening (there are -plugin flags), which is a bit strange... surely there wouldn't be LTO for doctests?? Your workspace doesn't seem to enable anything strange. FWIW I see the same behavior on the latest nightly.

@GuillaumeGomez you're familiar with doctests, is it expected that doctests do a lot of heavy parallel linking? And also, could it be that there is LTO going on or is this maybe just a default flag...

While it's a bit strange that this usage is happening here, programs using as much CPU as they can isn't really anything out of the ordinary, and suggests that there's also something wrong with your system if that causes such issues for you.

Noratrieb commented 2 months ago

this will be improved by #126245

the8472 commented 2 months ago

Does lowering the priority help? nice -n 20 cargo test --doc? Or does it actually run out of memory?

GuillaumeGomez commented 2 months ago

@GuillaumeGomez you're familiar with doctests, is it expected that doctests do a lot of heavy parallel linking? And also, could it be that there is LTO going on or is this maybe just a default flag...

Each doctest is compiled separately, so I assume a lot of linking is done as well.

Like mentioned by @Noratrieb, it should be greatly improved once https://github.com/rust-lang/rust/pull/126245 is merged.

However, it remains extremely surprising. Maybe limit the number of parallel jobs?

bjorn3 commented 2 months ago

The linker invocation suggests that there's LTO happening (there are -plugin flags), which is a bit strange... surely there wouldn't be LTO for doctests??

I believe gcc passes -plugin for the GCC LTO linker plugin to the linker unconditionally. Unless you explicitly use -Clinker-plugin-lto, all LTO is done by rustc in the case of rust code, not by the linker.

tingerrr commented 2 months ago

Since I'm at home, I'm testing on PC 2. Neither using a low number of jobs like -j 4 or using nice helps with 1.80.

However, 1.82-nighty works fine, it took 6 seconds, but it works.

rustc 1.80.0 (051478957 2024-07-21)
binary: rustc
commit-hash: 051478957371ee0084a7c0913941d2a8c4757bb9
commit-date: 2024-07-21
host: x86_64-unknown-linux-gnu
release: 1.80.0
LLVM version: 18.1.7

While it's a bit strange that this usage is happening here, programs using as much CPU as they can isn't really anything out of the ordinary, and suggests that there's also something wrong with your system if that causes such issues for you.

That's what I was thinking too, but it happens on two PCs which only have one thing in common, both distros being either arch or arch derived. SO if that was the culprit, surely others would have it too.

the8472 commented 2 months ago

try running watch -n 0.5 cat /proc/pressure/* in a separate terminal. That should be the first step to determine whether it's pure CPU load or involves the memory or IO subsystem too.

tingerrr commented 2 months ago

I don't know what values I should be expecting, but the 10 sec avg for both /proc/pressure/{cpu,io} is around 10-15 when running 1.80, whereas 1.82-ightly is basically just /proc/pressure/cpu. It seems the culprit is lots of IO?

the8472 commented 2 months ago

15% IO or CPU pressure might be a bit sluggish but definitely not "crashing the system" territory. And since you're not seeing memory pressure it's unlikely that it's getting OOM-killed either (though check dmesg/journalctl whether the OOM-killer left a note).

On linux nightly defaults to lld instead of the system's default linker, that could be making a difference.

Anyway, you should try investigating what's actually killing your system. Overheating? Watchdog? Resource exhaustion?

"crashes my PC" is not a very precise way to describe the symptoms. Power off, hard reboot, hanging, gui freezing, only some processes getting killed? Anything in the logs? Can you still Ctrl+Alt+F2 to a text console?

tingerrr commented 2 months ago

It's mostly the input freezing to the point that only holding the power button works. One time it simply shut off on its own, other times the wayland compositor on PC 1 restarted.

I can't easily test this, because it requires a restart for every try. Quite frankly, I don't have that much time to investigate. I'll see what I can find out, the next time I get to it.

saethlin commented 1 month ago

It's mostly the input freezing to the point that only holding the power button works.

This really sounds like an OOM. That's exactly what my Arch system does for some memory exhaustion cases. In other cases it'll kill random components of the system (like browser processes and pulseaudio) until it eventually kills the responsible program. But sometimes the system just grinds to a halt and I have to reboot it with the power button.

I'll poke at this tomorrow.

saethlin commented 1 month ago

@tingerrr What does free -h say on both PCs? (run it at any time, system could be idle or running whatever)

tingerrr commented 1 month ago

PC1:

               total        used        free      shared  buff/cache   available
Mem:            15Gi       4,4Gi       8,2Gi       1,3Gi       4,3Gi        10Gi
Swap:             0B          0B          0B

PC2:

               total        used        free      shared  buff/cache   available
Mem:            15Gi       2.1Gi        11Gi        31Mi       2.4Gi        13Gi
Swap:             0B          0B          0B

the8472 commented 1 month ago

Oh, no swap. Yeah ok, then there would be much less opportunity for paging before an OOM occurs (only writeback and explicitly disk-backed mappings).

In that case you should check journalctl if there were any oom kill messages after a hang/reboot. It's not guaranteed that those messages make it to disk after a crash, but perhaps something is there. Another option is attempting to drop to a text console via Ctrl+Alt+F2, logging in as root and then checking if there's something in dmesg.

I'd recommend adding some swap, if possible. You could also add some userspace oom daemon. Some of them require proper configuration, otherwise they'll do nothing or end up killing your entire login session whenever there's some pressure.

saethlin commented 1 month ago

I agree with that advice in general, but for this to just be an OOM, we would need the given builds to be using 16 GB or so. My quick experimentation with defaults indicates they peak at or below 8 GB. I'm going to experiment with some settings later and see if I can drive the peak memory to close to 16 GB.

rust-lang / rust