open-mpi / hwloc

Hardware locality (hwloc)
https://www.open-mpi.org/projects/hwloc
Other
563 stars 173 forks source link

reduce the number of open syscalls getting ENOENT from unexisting caches in sysfs #434

Open bgoglin opened 3 years ago

bgoglin commented 3 years ago

We currently try to open /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_map for every PU and Y between 0 and 9. That's usually 6 useless syscalls per PU since most CPUs have 4 caches per PU. That's almost 1ms per PU.

Linux numbers caches from 0 to N-1 internally but some of them might get skip when added to sysfs for some reasons (see cache_add_dev() in drivers/base/cacheinfo.c). That means we have no easy way to break the loop when index4 is missing as usual.

Doing stat on the parent directory might be a good way to find out the total number of indexY subdirectories. That would mean one syscall to avoid 6 syscalls. However btrfs (for fsroot regression tests) has some issues with nlink being wrong (see comments in topology-linux.c).

Reducing to 5 instead of 9 is likely a good start for now. Most current CPUs have 4 caches in sysfs. There are some L4 out there but I have never seen those in sysfs since they are rather outside of the CPUs. Itanium had 5 caches (L2i and L2d) but it's dead. So 5 works fine and gives us one free slot in case newer CPUs bring an additional level.

sthibaul commented 3 years ago

Perhaps using opendir() to get the actual list could be more efficient even if being an n+1th call? Even with a large directory that ends up with only one getdents64() system call.

xWuWux commented 11 months ago

The easiest solution would be to reduce the number of iterations\and use the opendir() function for efficient directory listing is a promising approach. It would lead to a reduction in unnecessary syscalls and enhance the performance of Open MPI's cache information retrieval process on Linux.

bgoglin commented 11 months ago

I did a quick test. We actually get more syscalls using opendir. Instead of having one useless openat() for each of the 6 non-existing caches (those failing openat are likely very cheap), opendir+readdirs+closedir uses 7 syscalls (openat+newfstatat+2fnctl+2getdents+close). That's for each core.

If you want to play with it, the code is in PR #629. There will be a tarball at https://ci.inria.fr/hwloc/job/basic/job/PR-629/ soon.