open-mpi / hwloc

Hardware locality (hwloc)
https://www.open-mpi.org/projects/hwloc
Other
569 stars 173 forks source link

Total memory reported by hwloc_topology_* API call is lower than actual total memory on Redhat 8.3 #445

Closed runqch closed 3 years ago

runqch commented 3 years ago

What version of hwloc are you using?

1.11.8

Which operating system and hardware are you running on?

Red Hat Enterprise Linux release 8.3 (Ootpa)

Details of the problem

Total memory reported by hwloctopology* API call is lower than actual total memory on Redhat 8.3

All nodes are Skylake 2x24 core cpus with 768GB of total memory, refer /proc/meminfo and top is as below:

[root@e10u13 ~]#‌ cat /proc/meminfo
MemTotal:       790747444 kB            <=== total memory
MemFree:        786205440 kB
MemAvailable:   783940964 kB
Buffers:            4964 kB
Cached:           575144 kB
SwapCached:          412 kB

[root@e10u13 ~]#‌top
top - 11:40:52 up 7 days, 22:12,  1 user,  load average: 0.17, 0.21, 0.18
Tasks: 1007 total,   1 running, 1006 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 772214.3 total, 767779.5 free,   3294.6 used,   1140.3 buff/cache
MiB Swap:   4096.0 total,   4045.5 free,     50.5 used. 765567.1 avail Mem 

But from hwloctopology* api (or by hwloc-ls), we only got 716G, which is lower than actual memory:

[root@e10u13 ~]#‌ hwloc-ls
Machine (715GB total)            <=== total memory reported by hwloc is lower than actual, why ??
  NUMANode L#‌0 (P#‌0 359GB)
    Package L#‌0 + L3 L#‌0 (33MB)
      L2 L#‌0 (1024KB) + L1d L#‌0 (32KB) + L1i L#‌0 (32KB) + Core L#‌0
        PU L#‌0 (P#‌0)
        PU L#‌1 (P#‌48)
      L2 L#‌1 (1024KB) + L1d L#‌1 (32KB) + L1i L#‌1 (32KB) + Core L#‌1
        PU L#‌2 (P#‌2)
        PU L#‌3 (P#‌50)
        ......
      L2 L#‌23 (1024KB) + L1d L#‌23 (32KB) + L1i L#‌23 (32KB) + Core L#‌23
        PU L#‌46 (P#‌46)
        PU L#‌47 (P#‌94)
    HostBridge L#‌0
      PCI 8086:a1d2
      PCI 8086:a182
      PCIBridge
        PCIBridge
          PCI 102b:0536
    HostBridge L#‌3
      PCIBridge
        PCI 1000:005f
          Block(Disk) L#‌0 "sda"
          Block(Disk) L#‌1 "sdb"
      PCIBridge
        PCI 15b3:1015
          Net L#‌2 "enp25s0f0"
          OpenFabrics L#‌3 "mlx5_0"
        PCI 15b3:1015
          Net L#‌4 "enp25s0f1"
          OpenFabrics L#‌5 "mlx5_1"
  NUMANode L#‌1 (P#‌1 356GB) + Package L#‌1 + L3 L#‌1 (33MB)
    L2 L#‌24 (1024KB) + L1d L#‌24 (32KB) + L1i L#‌24 (32KB) + Core L#‌24
      PU L#‌48 (P#‌1)
      PU L#‌49 (P#‌49)
      ......
    L2 L#‌47 (1024KB) + L1d L#‌47 (32KB) + L1i L#‌47 (32KB) + Core L#‌47
      PU L#‌94 (P#‌47)
      PU L#‌95 (P#‌95)
  Misc(MemoryModule)
  ...
  Misc(MemoryModule)

Additional information

[root@e10u13 ~]#‌ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Stepping: 4
CPU MHz: 2255.003
CPU max MHz: 3700.0000
CPU min MHz: 1200.0000
BogoMIPS: 5400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
bgoglin commented 3 years ago

The way it works is: hwloc reads NUMA node size from sysfs:

$ grep MemTotal /sys/devices/system/node/node*/meminfo
Node 0 MemTotal:       15753556 kB

Then it changes it in bytes by multiplying by 1024. Then it accumulates all NUMA nodes memory into the "total memory" of the machine in bytes. When lstopo prints it, it divides by 1024^3 for GB. So first thing to do here is to compare sysfs NUMA node sizes with 359GB and 356GB.

StevenHwanghh commented 3 years ago

The output of hwloc-ls is almost consistent with the result of MemTotal.

#grep MemTotal /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemTotal:       375262204 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemTotal:       375638744 kB

          hwloc-ls                  node*/meminfo
Node 0:    359G                       357G
Node 1:    356G                       357G

The gap is about 50G (=768-715) between physical memory and the output.

bgoglin commented 3 years ago

It seems that the question is actually why /proc/meminfo says MemTotal=790747444 instead of 375262204+375638744 = 750900948. I don't see any difference on my machines. I know all these values can slightly change after a reboot, but not by 50GB.

The other small differences (357 vs 359G) are likely related to dividing by 1024 instead of 1000, or by kB meaning kiB or kB depending on where we're looking.

runqch commented 3 years ago

This issue occurs for redhat 8.3 machines. For redhat 7.* , no such problem. Do you have any clue what would leads to such big gap? Do we ever verified hwloc 1.11.8 on redhat 8.3 machines ? @bgoglin

bgoglin commented 3 years ago

From what I see in some discussion about the Linux kernel, some pages are "reserved" by the kernel for some "init" data. Those are removed from NUMA node available memory (because that's where memory accounting really occurs) but not from the total machine memory (because this amount isn't very useful in the kernel code from what I understand). Maybe things changed in the kernel between RHEL7 and 8 but I don't think it matters much anyway. The entire machine memory isn't available anyway since the kernel allocates its own things. These MemTotal fields are only a very vague indicator of what applications may allocate if they are alone on the machine.

hwloc 1.11.8 does nothing special about this. This code has been the same trivial code explained above from 1.0 (10 years ago) up to latest 2.5, it's not going to report anything different now unless the kernel changes.

I am closing this issue since there's no bug in hwloc but only a strange kernel behavior, but we can continue discussing if you wish. https://toroid.org/linux-physical-memory seems to discuss related things. It looks like info about memory disappearing like is in the early kernel boot log. And the the entire /proc/meminfo et entire /sys/devices/system/node/node*/memory may help too.

runqch commented 3 years ago

Thanks for providing so many helps on this issue. Really appreciate.