pytorch / cpuinfo

CPU INFOrmation library (x86/x86-64/ARM/ARM64, Linux/Windows/Android/macOS/iOS)
BSD 2-Clause "Simplified" License
962 stars 306 forks source link

FreeBSD: Xeon CPUs are not detected properly #248

Open Peter2121 opened 2 weeks ago

Peter2121 commented 2 weeks ago

cpuinfo built from source (HEAD) on FreeBSD 13.3. Tested on 3 different hardware.

  1. Desktop with i7-2700 - OK:
host-peter% ./build/local/cpuid-dump
CPUID 00000000: 0000000D-756E6547-6C65746E-49656E69 [GenuineIntel]
CPUID 00000001: 000206A7-03100800-1F9AE3BF-BFEBFBFF
CPUID 00000002: 76035A01-00F0B2FF-00000000-00CA0000
CPUID 00000003: 00000000-00000000-00000000-00000000
CPUID 00000004: 1C004121-01C0003F-0000003F-00000000 [SL 00]
CPUID 00000004: 1C004122-01C0003F-0000003F-00000000 [SL 01]
CPUID 00000004: 1C004143-01C0003F-000001FF-00000000 [SL 02]
CPUID 00000004: 1C03C163-03C0003F-00001FFF-00000006 [SL 03]
CPUID 00000005: 00000040-00000040-00000003-00001120
CPUID 00000006: 00000077-00000002-00000009-00000000
CPUID 00000007: 00000000-00000000-00000000-00000000 [SL 00]
CPUID 00000008: 00000000-00000000-00000000-00000000
CPUID 00000009: 00000000-00000000-00000000-00000000
CPUID 0000000A: 07300403-00000000-00000000-00000603
CPUID 0000000B: 00000001-00000002-00000100-00000003 [SL 00]
CPUID 0000000B: 00000004-00000008-00000201-00000003 [SL 01]
CPUID 0000000C: 00000000-00000000-00000000-00000000
CPUID 0000000D: 00000007-00000340-00000340-00000000
CPUID 80000000: 80000008-00000000-00000000-00000000
CPUID 80000001: 00000000-00000000-00000001-28100800
CPUID 80000002: 20202020-49202020-6C65746E-20295228 [       Intel(R) ]
CPUID 80000003: 65726F43-294D5428-2D376920-30303732 [Core(TM) i7-2700]
CPUID 80000004: 5043204B-20402055-30352E33-007A4847 [K CPU @ 3.50GHz]
CPUID 80000005: 00000000-00000000-00000000-00000000
CPUID 80000006: 00000000-00000000-01006040-00000000
CPUID 80000007: 00000000-00000000-00000000-00000100
CPUID 80000008: 00003024-00000000-00000000-00000000
host-peter% ./build/local/cpu-info
Packages:
    0: Intel Core i7-2700K
Microarchitectures:
    4x Sandy Bridge
Cores:
    0: 2 processors (0-1), Intel Sandy Bridge
    1: 2 processors (2-3), Intel Sandy Bridge
    2: 2 processors (4-5), Intel Sandy Bridge
    3: 2 processors (6-7), Intel Sandy Bridge
Logical processors:
    0: APIC ID 0x00000000
    1: APIC ID 0x00000001
    2: APIC ID 0x00000002
    3: APIC ID 0x00000003
    4: APIC ID 0x00000004
    5: APIC ID 0x00000005
    6: APIC ID 0x00000006
    7: APIC ID 0x00000007
  1. Very old DELL server with Xeon - FAILED:
nashost# ./build/local/cpuid-dump
CPUID 00000000: 00000005-756E6547-6C65746E-49656E69 [GenuineIntel]
CPUID 00000001: 00000F41-00020800-0000641D-BFEBFBFF
CPUID 00000002: 605B5001-00000000-00000000-007C7040
CPUID 00000003: 00000000-00000000-00000000-00000000
CPUID 00000004: 00004121-01C0003F-0000001F-00000000 [SL 00]
CPUID 00000004: 00004143-01C0103F-000003FF-00000000 [SL 01]
CPUID 00000005: 00000040-00000040-00000000-00000000
CPUID 80000000: 80000008-00000000-00000000-00000000
CPUID 80000001: 00000000-00000000-00000000-20100800
CPUID 80000002: 20202020-20202020-20202020-20202020 [                ]
CPUID 80000003: 6E492020-286C6574-58202952-286E6F65 [  Intel(R) Xeon(]
CPUID 80000004: 20294D54-20555043-30322E33-007A4847 [TM) CPU 3.20GHz]
CPUID 80000005: 00000000-00000000-00000000-00000000
CPUID 80000006: 00000000-00000000-04006040-00000000
CPUID 80000007: 00000000-00000000-00000000-00000000
CPUID 80000008: 00003024-00000000-00000000-00000000
nashost# ./build/local/cpu-info
Error in cpuinfo: failed to detect topology
failed to initialize CPU information
  1. DELL server with Xeon - FAILED:
srv1# ./build/local/cpuid-dump
CPUID 00000000: 0000000D-756E6547-6C65746E-49656E69 [GenuineIntel]
CPUID 00000001: 000206D7-0A200800-1FBEE3FF-BFEBFBFF
CPUID 00000002: 76035A01-00F0B0FF-00000000-00CA0000
CPUID 00000003: 00000000-00000000-00000000-00000000
CPUID 00000004: 3C004121-01C0003F-0000003F-00000000 [SL 00]
CPUID 00000004: 3C004122-01C0003F-0000003F-00000000 [SL 01]
CPUID 00000004: 3C004143-01C0003F-000001FF-00000000 [SL 02]
CPUID 00000004: 3C07C163-04C0003F-00002FFF-00000006 [SL 03]
CPUID 00000005: 00000040-00000040-00000003-00021120
CPUID 00000006: 00000077-00000002-00000001-00000000
CPUID 00000007: 00000000-00000000-00000000-00000000 [SL 00]
CPUID 00000008: 00000000-00000000-00000000-00000000
CPUID 00000009: 00000000-00000000-00000000-00000000
CPUID 0000000A: 07300803-00000000-00000000-00000603
CPUID 0000000B: 00000001-00000001-00000100-0000000A [SL 00]
CPUID 0000000B: 00000005-00000006-00000201-0000000A [SL 01]
CPUID 0000000C: 00000000-00000000-00000000-00000000
CPUID 0000000D: 00000007-00000340-00000340-00000000
CPUID 80000000: 80000008-00000000-00000000-00000000
CPUID 80000001: 00000000-00000000-00000001-2C100800
CPUID 80000002: 20202020-49202020-6C65746E-20295228 [       Intel(R) ]
CPUID 80000003: 6E6F6558-20295228-20555043-342D3545 [Xeon(R) CPU E5-4]
CPUID 80000004: 20373136-20402030-30392E32-007A4847 [617 0 @ 2.90GHz]
CPUID 80000005: 00000000-00000000-00000000-00000000
CPUID 80000006: 00000000-00000000-01006040-00000000
CPUID 80000007: 00000000-00000000-00000000-00000100
CPUID 80000008: 0000302E-00000000-00000000-00000000
srv1# ./build/local/cpu-info
Error in cpuinfo: failed to detect topology
failed to initialize CPU information

So, pytorch cannot be used correctly on the servers (only one CPU core is used). On the desktop it uses only one core as well :(

cyyever commented 2 weeks ago

@Peter2121 I only tested the code on FreeBSD 14.0. Welcome to help fix the invalid detections.

Peter2121 commented 2 weeks ago

@cyyever host-peter is under 13.3-RELEASE (my PC where I built cpuinfo), srv1 is under 14.0-RELEASE. I rebuilt cpuinfo locally on this server - no changes.

cyyever commented 2 weeks ago

@Peter2121 What are the output of

sysctl kern.sched.topology_spec

on the failed hosts?

Peter2121 commented 2 weeks ago
srv1# sysctl kern.sched.topology_spec
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="12" mask="fff,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11</cpu>
  <children>
   <group level="2" cache-level="3">
    <cpu count="6" mask="3f,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0">0, 1, 2, 3, 4, 5</cpu>
    <flags><flag name="NODE">NUMA node</flag></flags>
   </group>
   <group level="2" cache-level="3">
    <cpu count="6" mask="fc0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0">6, 7, 8, 9, 10, 11</cpu>
    <flags><flag name="NODE">NUMA node</flag></flags>
   </group>
  </children>
 </group>
</groups>
nashost# sysctl kern.sched.topology_spec
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="4" mask="f,0,0,0">0, 1, 2, 3</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="2" mask="3,0,0,0">0, 1</cpu>
    <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
   </group>
   <group level="2" cache-level="2">
    <cpu count="2" mask="c,0,0,0">2, 3</cpu>
    <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
   </group>
  </children>
 </group>
</groups>
cyyever commented 1 week ago

@Peter2121 Help check the fix in #249?

Peter2121 commented 1 week ago

The patched version works correctly on srv1.

It does not work on nashost: illegal hardware instruction (core dumped)

It does not work on desktop anymore:

Error in cpuinfo: failed to parse topology_spec: <groups>
 <group level="1" cache-level="3">
  <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="2" mask="3,0,0,0">0, 1</cpu>
    <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
   </group>
   <group level="2" cache-level="2">
    <cpu count="2" mask="c,0,0,0">2, 3</cpu>
    <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
   </group>
   <group level="2" cache-level="2">
    <cpu count="2" mask="30,0,0,0">4, 5</cpu>
    <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
   </group>
   <group level="2" cache-level="2">
    <cpu count="2" mask="c0,0,0,0">6, 7</cpu>
    <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
   </group>
  </children>
 </group>
</groups>

Error in cpuinfo: failed to detect topology
failed to initialize CPU information
cyyever commented 1 week ago

@Peter2121 Can you git pull and re-check? If you are familiar with valgrind, can you print the valgrind outputs for invocations leading to errors such as "illegal hardware instruction (core dumped)"?

sudo pkg install valgrind
valgrind ./cpu-info
Peter2121 commented 5 days ago

@cyyever I don't see any commits in master here, so if I revert #249 - I am up-to-date with the initial version. Please, explain me what version do I need to test?

cyyever commented 5 days ago

Test the PR mentioned in this discussion.

Peter2121 commented 5 days ago

Anyway, for the patched version:

nashost# valgrind ./cpu-info
==464== Memcheck, a memory error detector
==464== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==464== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==464== Command: ./cpu-info
==464==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF8 0x57 0xC0 0xC5 0xFC 0x29 0x84 0x24 0x20
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==464== valgrind: Unrecognised instruction at address 0x205b76.
==464==    at 0x205B76: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==464==    by 0x4873452: pthread_once (in /lib/libthr.so.3)
==464==    by 0x2058D6: cpuinfo_initialize (src/init.c:28)
==464==    by 0x204772: main (tools/cpu-info.c:291)
==464== Your program just tried to execute an instruction that Valgrind
==464== did not recognise.  There are two possible reasons for this.
==464== 1. Your program has a bug and erroneously jumped to a non-code
==464==    location.  If you are running Memcheck and you just saw a
==464==    warning about a bad jump, it's probably your program's fault.
==464== 2. The instruction is legitimate but Valgrind doesn't handle it,
==464==    i.e. it's Valgrind's fault.  If you think this is the case or
==464==    you are not sure, please let us know and we'll try to fix it.
==464== Either way, Valgrind will now raise a SIGILL signal which will
==464== probably kill your program.
==464==
==464== Process terminating with default action of signal 4 (SIGILL): dumping core
==464==  Illegal opcode at address 0x205B76
==464==    at 0x205B76: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==464==    by 0x4873452: pthread_once (in /lib/libthr.so.3)
==464==    by 0x2058D6: cpuinfo_initialize (src/init.c:28)
==464==    by 0x204772: main (tools/cpu-info.c:291)
==464==
==464== HEAP SUMMARY:
==464==     in use at exit: 2,288 bytes in 6 blocks
==464==   total heap usage: 7 allocs, 1 frees, 2,802 bytes allocated
==464==
==464== LEAK SUMMARY:
==464==    definitely lost: 128 bytes in 2 blocks
==464==    indirectly lost: 0 bytes in 0 blocks
==464==      possibly lost: 0 bytes in 0 blocks
==464==    still reachable: 2,160 bytes in 4 blocks
==464==         suppressed: 0 bytes in 0 blocks
==464== Rerun with --leak-check=full to see details of leaked memory
==464==
==464== For lists of detected and suppressed errors, rerun with: -s
==464== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction  valgrind ./cpu-info
Peter2121 commented 5 days ago

Ah, I see that your PR was updated! :) I'll repatch and recheck shortly...

cyyever commented 5 days ago

Thank you! Help me check whether it still crashes

Peter2121 commented 5 days ago

After new patch from the PR:

OK on srv1 OK on desktop Still crashes on nashost:

nashost# valgrind ./cpu-info
==19756== Memcheck, a memory error detector
==19756== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==19756== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==19756== Command: ./cpu-info
==19756==
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF8 0x57 0xC0 0xC5 0xFC 0x29 0x84 0x24 0x20
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==19756== valgrind: Unrecognised instruction at address 0x2064f3.
==19756==    at 0x2064F3: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==19756==    by 0x4873452: pthread_once (in /lib/libthr.so.3)
==19756==    by 0x2062E2: cpuinfo_initialize (src/init.c:28)
==19756==    by 0x205292: main (tools/cpu-info.c:291)
==19756== Your program just tried to execute an instruction that Valgrind
==19756== did not recognise.  There are two possible reasons for this.
==19756== 1. Your program has a bug and erroneously jumped to a non-code
==19756==    location.  If you are running Memcheck and you just saw a
==19756==    warning about a bad jump, it's probably your program's fault.
==19756== 2. The instruction is legitimate but Valgrind doesn't handle it,
==19756==    i.e. it's Valgrind's fault.  If you think this is the case or
==19756==    you are not sure, please let us know and we'll try to fix it.
==19756== Either way, Valgrind will now raise a SIGILL signal which will
==19756== probably kill your program.
==19756==
==19756== Process terminating with default action of signal 4 (SIGILL): dumping core
==19756==  Illegal opcode at address 0x2064F3
==19756==    at 0x2064F3: cpuinfo_x86_freebsd_init (src/x86/freebsd/init.c:71)
==19756==    by 0x4873452: pthread_once (in /lib/libthr.so.3)
==19756==    by 0x2062E2: cpuinfo_initialize (src/init.c:28)
==19756==    by 0x205292: main (tools/cpu-info.c:291)
==19756==
==19756== HEAP SUMMARY:
==19756==     in use at exit: 2,288 bytes in 6 blocks
==19756==   total heap usage: 7 allocs, 1 frees, 2,802 bytes allocated
==19756==
==19756== LEAK SUMMARY:
==19756==    definitely lost: 128 bytes in 2 blocks
==19756==    indirectly lost: 0 bytes in 0 blocks
==19756==      possibly lost: 0 bytes in 0 blocks
==19756==    still reachable: 2,160 bytes in 4 blocks
==19756==         suppressed: 0 bytes in 0 blocks
==19756== Rerun with --leak-check=full to see details of leaked memory
==19756==
==19756== For lists of detected and suppressed errors, rerun with: -s
==19756== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
zsh: illegal hardware instruction  valgrind ./cpu-info

IMHO, this is not very important as the server is 10+ years old. But it would be nice to understand the reason of the crash ;) I'll try to test on other physical CPUs if I find something with FreeBSD installed.