Open Peter2121 opened 2 weeks ago
@Peter2121 I only tested the code on FreeBSD 14.0. Welcome to help fix the invalid detections.
@cyyever
host-peter
is under 13.3-RELEASE (my PC where I built cpuinfo), srv1
is under 14.0-RELEASE.
I rebuilt cpuinfo locally on this server - no changes.
@Peter2121 What are the output of
sysctl kern.sched.topology_spec
on the failed hosts?
srv1# sysctl kern.sched.topology_spec
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="12" mask="fff,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11</cpu>
<children>
<group level="2" cache-level="3">
<cpu count="6" mask="3f,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0">0, 1, 2, 3, 4, 5</cpu>
<flags><flag name="NODE">NUMA node</flag></flags>
</group>
<group level="2" cache-level="3">
<cpu count="6" mask="fc0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0">6, 7, 8, 9, 10, 11</cpu>
<flags><flag name="NODE">NUMA node</flag></flags>
</group>
</children>
</group>
</groups>
nashost# sysctl kern.sched.topology_spec
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="4" mask="f,0,0,0">0, 1, 2, 3</cpu>
<children>
<group level="2" cache-level="2">
<cpu count="2" mask="3,0,0,0">0, 1</cpu>
<flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
</group>
<group level="2" cache-level="2">
<cpu count="2" mask="c,0,0,0">2, 3</cpu>
<flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
</group>
</children>
</group>
</groups>
@Peter2121 Help check the fix in #249?
The patched version works correctly on srv1
.
It does not work on nashost
:
illegal hardware instruction (core dumped)
It does not work on desktop anymore:
Error in cpuinfo: failed to parse topology_spec: <groups>
<group level="1" cache-level="3">
<cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<children>
<group level="2" cache-level="2">
<cpu count="2" mask="3,0,0,0">0, 1</cpu>
<flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
</group>
<group level="2" cache-level="2">
<cpu count="2" mask="c,0,0,0">2, 3</cpu>
<flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
</group>
<group level="2" cache-level="2">
<cpu count="2" mask="30,0,0,0">4, 5</cpu>
<flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
</group>
<group level="2" cache-level="2">
<cpu count="2" mask="c0,0,0,0">6, 7</cpu>
<flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
</group>
</children>
</group>
</groups>
Error in cpuinfo: failed to detect topology
failed to initialize CPU information
@Peter2121 Can you git pull and re-check? If you are familiar with valgrind, can you print the valgrind outputs for invocations leading to errors such as "illegal hardware instruction (core dumped)"?
sudo pkg install valgrind
valgrind ./cpu-info
@cyyever I don't see any commits in master here, so if I revert #249 - I am up-to-date with the initial version. Please, explain me what version do I need to test?
Test the PR mentioned in this discussion.
Anyway, for the patched version:
nashost# valgrind ./cpu-info
==464== Memcheck, a memory error detector
==464== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==464== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==464== Command: ./cpu-info
==464==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF8 0x57 0xC0 0xC5 0xFC 0x29 0x84 0x24 0x20
vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0
==464== valgrind: Unrecognised instruction at address 0x205b76.
==464== at 0x205B76: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==464== by 0x4873452: pthread_once (in /lib/libthr.so.3)
==464== by 0x2058D6: cpuinfo_initialize (src/init.c:28)
==464== by 0x204772: main (tools/cpu-info.c:291)
==464== Your program just tried to execute an instruction that Valgrind
==464== did not recognise. There are two possible reasons for this.
==464== 1. Your program has a bug and erroneously jumped to a non-code
==464== location. If you are running Memcheck and you just saw a
==464== warning about a bad jump, it's probably your program's fault.
==464== 2. The instruction is legitimate but Valgrind doesn't handle it,
==464== i.e. it's Valgrind's fault. If you think this is the case or
==464== you are not sure, please let us know and we'll try to fix it.
==464== Either way, Valgrind will now raise a SIGILL signal which will
==464== probably kill your program.
==464==
==464== Process terminating with default action of signal 4 (SIGILL): dumping core
==464== Illegal opcode at address 0x205B76
==464== at 0x205B76: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==464== by 0x4873452: pthread_once (in /lib/libthr.so.3)
==464== by 0x2058D6: cpuinfo_initialize (src/init.c:28)
==464== by 0x204772: main (tools/cpu-info.c:291)
==464==
==464== HEAP SUMMARY:
==464== in use at exit: 2,288 bytes in 6 blocks
==464== total heap usage: 7 allocs, 1 frees, 2,802 bytes allocated
==464==
==464== LEAK SUMMARY:
==464== definitely lost: 128 bytes in 2 blocks
==464== indirectly lost: 0 bytes in 0 blocks
==464== possibly lost: 0 bytes in 0 blocks
==464== still reachable: 2,160 bytes in 4 blocks
==464== suppressed: 0 bytes in 0 blocks
==464== Rerun with --leak-check=full to see details of leaked memory
==464==
==464== For lists of detected and suppressed errors, rerun with: -s
==464== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction valgrind ./cpu-info
Ah, I see that your PR was updated! :) I'll repatch and recheck shortly...
Thank you! Help me check whether it still crashes
After new patch from the PR:
OK on srv1 OK on desktop Still crashes on nashost:
nashost# valgrind ./cpu-info
==19756== Memcheck, a memory error detector
==19756== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==19756== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==19756== Command: ./cpu-info
==19756==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF8 0x57 0xC0 0xC5 0xFC 0x29 0x84 0x24 0x20
vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0
==19756== valgrind: Unrecognised instruction at address 0x2064f3.
==19756== at 0x2064F3: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==19756== by 0x4873452: pthread_once (in /lib/libthr.so.3)
==19756== by 0x2062E2: cpuinfo_initialize (src/init.c:28)
==19756== by 0x205292: main (tools/cpu-info.c:291)
==19756== Your program just tried to execute an instruction that Valgrind
==19756== did not recognise. There are two possible reasons for this.
==19756== 1. Your program has a bug and erroneously jumped to a non-code
==19756== location. If you are running Memcheck and you just saw a
==19756== warning about a bad jump, it's probably your program's fault.
==19756== 2. The instruction is legitimate but Valgrind doesn't handle it,
==19756== i.e. it's Valgrind's fault. If you think this is the case or
==19756== you are not sure, please let us know and we'll try to fix it.
==19756== Either way, Valgrind will now raise a SIGILL signal which will
==19756== probably kill your program.
==19756==
==19756== Process terminating with default action of signal 4 (SIGILL): dumping core
==19756== Illegal opcode at address 0x2064F3
==19756== at 0x2064F3: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==19756== by 0x4873452: pthread_once (in /lib/libthr.so.3)
==19756== by 0x2062E2: cpuinfo_initialize (src/init.c:28)
==19756== by 0x205292: main (tools/cpu-info.c:291)
==19756==
==19756== HEAP SUMMARY:
==19756== in use at exit: 2,288 bytes in 6 blocks
==19756== total heap usage: 7 allocs, 1 frees, 2,802 bytes allocated
==19756==
==19756== LEAK SUMMARY:
==19756== definitely lost: 128 bytes in 2 blocks
==19756== indirectly lost: 0 bytes in 0 blocks
==19756== possibly lost: 0 bytes in 0 blocks
==19756== still reachable: 2,160 bytes in 4 blocks
==19756== suppressed: 0 bytes in 0 blocks
==19756== Rerun with --leak-check=full to see details of leaked memory
==19756==
==19756== For lists of detected and suppressed errors, rerun with: -s
==19756== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction valgrind ./cpu-info
IMHO, this is not very important as the server is 10+ years old. But it would be nice to understand the reason of the crash ;) I'll try to test on other physical CPUs if I find something with FreeBSD installed.
cpuinfo built from source (HEAD) on FreeBSD 13.3. Tested on 3 different hardware.
So, pytorch cannot be used correctly on the servers (only one CPU core is used). On the desktop it uses only one core as well :(