sudkannan / likwid

Automatically exported from code.google.com/p/likwid
GNU General Public License v3.0
0 stars 0 forks source link

likwid-pin & cgroups don't work #176

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create a cgroup cpuset with a limited number of cores (1 for example). In my 
case, a cpuset created by PBS.
2. Try to use likwid-pin

What is the expected output? What do you see instead?
ERROR: Core 0 on socket 0 not existing!

What version of the product are you using?
3.1.3

Please provide any additional information below.

If you try anything with likwid-pin, you always get 'ERROR: Core 0 on socket 0 
not existing!'. Only running likwid-pin without arguments gives the help.

Original issue reported on code.google.com by wpoel...@gmail.com on 3 Feb 2015 at 10:00

GoogleCodeExporter commented 9 years ago
I checked this issue for the current development version of and found multiple 
problems:

The backend hwloc only supplies data like CPU model number when CPU 0 is 
contained in the cgroup's cpuset.
The cpuid backend works but reads a wrong number of HW threads, therefore the N 
affinity group in likwid-pin prints the active CPUs followed by unusable IDs 
for the remaining CPUs.

My suggestion is to get the number of CPUs from /proc/self/status and do a 
fallback to cpuid if hwloc fails retrieving data like CPU model number.

I did not check it for 3.1.3 version but since the 3.1.3 version uses only the 
cpuid backend, comparable the the current development one, there will be 
similar errors. 
I attached the patch from the HPC UGent github repo.

Original comment by Thomas.R...@googlemail.com on 9 Feb 2015 at 3:08

Attachments:

GoogleCodeExporter commented 9 years ago
/proc/self/status would probably work, but can you not simply read the thread 
affinity mask? That would work in all POSIX cases?

Original comment by wpoel...@gmail.com on 9 Feb 2015 at 3:17

GoogleCodeExporter commented 9 years ago
There are multiple places where we can get the affinity mask, that is not the 
problem. But bigger changes are needed to support them. Currently LIKWID makes 
some assumptions that are not met when using cpusets. A small example is the 
topology code where we want to collect the topology info of the whole machine, 
not only of the parts that are controlled by the CPUs in the cpuset. In other 
cases we want the actual CPUs of the execution environment.

Original comment by Thomas.R...@googlemail.com on 10 Feb 2015 at 12:07

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r482.

Original comment by Thomas.R...@googlemail.com on 13 Feb 2015 at 3:26

GoogleCodeExporter commented 9 years ago
I implemented a better cgroup handling for LIKWID. The most problematic issue 
was that neither hwloc nor cpuid can read the system topology of the whole 
machine if in a cpuset. Therefore I wrote a new interface that gets all 
information from procfs/sysfs. For the affinity system, only the CPUs, that are 
part of the current cpuset, are added to the domains. There might be the case 
that affinity domains contain no CPUs now. Since the LIKWID system now only 
knows these CPUs in the cpuset, no changes to the pinning library are needed.
The topology output code does not mark the CPUs that are present in the cgroup 
but this can be easily done by appending a '*' or print it in color.

Original comment by Thomas.R...@googlemail.com on 13 Feb 2015 at 3:33