jaypipes / ghw

Go HardWare discovery/inspection library
Apache License 2.0
1.62k stars 174 forks source link

add topology option #354

Open xrmzju opened 8 months ago

xrmzju commented 8 months ago
  1. add topology option: DisableNodeCaches, DisableNodeAreas, DisableNodeDistances when collect topology
  2. pop err to ctx when collect topologyNodes so that the caller can know it
jaypipes commented 8 months ago

@xrmzju Hi! Thanks for your submission! Can you tell me a little more about why you want to disable collection of NUMA topology information?

xrmzju commented 8 months ago

some scenario the collection of topology data fails

Yes, your guess is correct. There was a failure in collecting memory area for a certain collection of topology data, but no error was raised. As a result, my program continued to run with incorrect topology information. In my scenario, I only require the CPU NUMA topology information and I do not need to consider the NodeCaches, NodeAreas, and NodeDistances. so i made the modification above

ffromani commented 8 months ago

some scenario the collection of topology data fails

Yes, your guess is correct. There was a failure in collecting memory area for a certain collection of topology data, but no error was raised. As a result, my program continued to run with incorrect topology information. In my scenario, I only require the CPU NUMA topology information and I do not need to consider the NodeCaches, NodeAreas, and NodeDistances. so i made the modification above

Thanks for clarifying. Could you please share a description of the hardware on which the collection fails? E.g was it a regular NUMA x86 machine? Perhaps you were using (relatively) new technology like CXL? Or was it arm?

In general I'm reluctant to add so fine control about collection of information - adds too many knobs and makes the code less regular, so I'd like to learn more about the usecase.

xrmzju commented 8 months ago

some scenario the collection of topology data fails

Yes, your guess is correct. There was a failure in collecting memory area for a certain collection of topology data, but no error was raised. As a result, my program continued to run with incorrect topology information. In my scenario, I only require the CPU NUMA topology information and I do not need to consider the NodeCaches, NodeAreas, and NodeDistances. so i made the modification above

Thanks for clarifying. Could you please share a description of the hardware on which the collection fails? E.g was it a regular NUMA x86 machine? Perhaps you were using (relatively) new technology like CXL? Or was it arm?

In general I'm reluctant to add so fine control about collection of information - adds too many knobs and makes the code less regular, so I'd like to learn more about the usecase.

image

Here is the failure message. The operating system we are using has been specifically designed by our internal team, which means there might be some bugs or unique features. However, in my current scenario, I do not require any memory area information. Therefore, I have been attempting to find a solution to bypass or skip it. I feel free to add less fine control about collection of information, maybe some option like CPUTopologyOnly which will disable collecting NodeCaches and NodeAreas?

ffromani commented 8 months ago

@xrmzju thanks for sharing. I'll try to think about a more generic solution. I'll get back ASAP.