Closed duboism closed 2 years ago
Hi
Not sure it will be easy at that time for us to change this behavior. It will require a lot of tests on different CPU and cluster types.
Alternatively, at runtime, you can use the following options:
--no-affinity : disable CPU affinity file
--thread-affinity : file containing thread mask, one row per thread
Hello,
I forgot to mention that --no-affinity
and --thread-affinity
didn't solve the problem (if I understand the code correctly it's because the code still uses the incorrect number of processors). So we had to recompile pftools
.
As for the modification, I think it's simply changing one line:
https://github.com/sib-swiss/pftools3/blob/0ab4a3753d7f2146a78ff590cc730ac37a30d9ac/src/C/utils/system.c#L911
Using _SC_NPROCESSORS_ONLN
should do the trick. I think that it should not change anything for most systems (where __SCNPROCESSORS_CONF
= _SC_NPROCESSORS_ONLN
) and performance should be the same.
I will talk about that with the boss when both we will be back from holidays.
We will let you know
At the time of writing I was not aware of _SC_NPROCESSORS_ONLN. I need to go back and check how I use nOverallCores within the affinity loops. This is to the best I can remember, so be nice. You are the first to limit the number of available processors and it could well screw the affinity stuff as I query the bios while running on each core. I suspect that enforcing the use of an offline core might cause a deadly issue. Would you know the type of kernel message you get: SIGFAULT, SIGKILL, ... Thank you. Thierry Schuepbach
Again at the time of writing, I used to bundle a SystemInfo within my code, have you run it and what does it give?
I looked deeper and indeed when affinity is enabled at compile time I will go over all cores according to nOverallProcs and query the BIOS. According to this, the software should not crash. However, it will be an issue as the affinity mask will be replaced by go everywhere allowed which is not ideal. If you could provide me if the output of SystemInfo and tell me more on your infrastructure, we could potentially find a fix. In the meantime, I advise you to continue disabling AFFINITY at compile time. Best regards, Thierry Schuepbach
This is to the best I can remember, so be nice.
Don't worry, I will be nice ;). The issue is a bit esoteric and seems related to the peculiar configuration at the computing center I use.
Would you know the type of kernel message you get: SIGFAULT, SIGKILL, ...
AFAIU, there is no signal; sched_setaffinity
returns EINVAL
(see here) and pfsearchV3
exits with return value 1.
Again at the time of writing, I used to bundle a SystemInfo within my code, have you run it and what does it give?
I don't understand what is SystemInfo
in your message. I have found nothing like this in the sources nor using google (except for windows which is not applicable).
Just chiming in to say that we see the same behavior on our cluster nodes. The workaround we suggest is to request the whole node.
Hi @duboism and @ynanyam
I have updated pftools v3 to version 3.2.7 incorporating your suggestion about _SC_NPROCESSORS_ONLN
Does it fix your cluster issue?
Hello,
Sorry for the late reply.
I did the following: compiled v3.2.7 (without any option) and launched pfsearchV3
(without any option). It works in the sense that the help is displayed.
Same procedure with v3.2.6 but pfsearchV3
crashes with the previous error.
So AFAIAC, the problem is solved. Did you have time to benchmark this ? Sorry I couldn't help you more on this.
I have not fully benchmarked it, my tests agree with yours.
Hi, I am currently using pftools 3.2.8 and I actually have the same issue: Error setting affinity!
With small datasets it works fine (so just running pfsearchV3
without input file works fine), but for large datasets, I get the affinity issue.
So you can try compiling pftoolsV3 without affinity.
It should be fixed in the develop
branch now see #25
mkdir build && cd build/
cmake .. -DUSE_AFFINITY=OFF
"old" machines and pftoolsV3 code for affinity are sometimes not very compatible
Looks like master doesn't compile
In file included from /home/marcadella/test/pftools3/src/C/utils/output.c:14:0:
/home/marcadella/test/pftools3/src/C/utils/../include/pfSequence.h:50:18: error: field ‘LastModification’ has incomplete type
struct timespec LastModification;
^~~~~~~~~~~~~~~~
src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/build.make:62: recipe for target 'src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/output.c.o' failed
make[2]: *** [src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/output.c.o] Error 1
CMakeFiles/Makefile2:1819: recipe for target 'src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/all' failed
make[1]: *** [src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/all] Error 2
Makefile:162: recipe for target 'all' failed
make: *** [all] Error 2
This is "expected", the patch is in the develop branch
Yep, it works with the fix from #25
Hello,
We use use
pftools
as bundled in ebi-pf-team/interproscan.A few weeks ago we started to have issues when using it in a large computing center:
pfsearchV3
stopped immediately with the messageError setting affinity!
. The system team discovered thatpfsearchV3
tries to use more cores that available. This is caused by the fact thatpftools
uses_SC_NPROCESSORS_CONF
to detect the number of cores but on this cluster the number of available cores (_SC_NPROCESSORS_ONLN
) is lower than_SC_NPROCESSORS_CONF
(I don't know yet of this is normal). We circumvented the problem by recompilingpftools
with-DUSE_AFFINITY=OFF
.However, I was wondering if
pftools
should use_SC_NPROCESSORS_ONLN
to detect the number of available cores at runtime.