sib-swiss / pftools3

A suite of tools to build and search generalized profiles
GNU General Public License v2.0
10 stars 7 forks source link

Detecting the number of available cores #20

Closed duboism closed 2 years ago

duboism commented 3 years ago

Hello,

We use use pftools as bundled in ebi-pf-team/interproscan.

A few weeks ago we started to have issues when using it in a large computing center: pfsearchV3 stopped immediately with the message Error setting affinity!. The system team discovered that pfsearchV3 tries to use more cores that available. This is caused by the fact that pftools uses _SC_NPROCESSORS_CONF to detect the number of cores but on this cluster the number of available cores (_SC_NPROCESSORS_ONLN) is lower than _SC_NPROCESSORS_CONF (I don't know yet of this is normal). We circumvented the problem by recompiling pftools with -DUSE_AFFINITY=OFF.

However, I was wondering if pftools should use _SC_NPROCESSORS_ONLN to detect the number of available cores at runtime.

smoretti commented 3 years ago

Hi

Not sure it will be easy at that time for us to change this behavior. It will require a lot of tests on different CPU and cluster types.

Alternatively, at runtime, you can use the following options:

--no-affinity                           : disable CPU affinity file
--thread-affinity                       : file containing thread mask, one row per thread
duboism commented 3 years ago

Hello,

I forgot to mention that --no-affinity and --thread-affinity didn't solve the problem (if I understand the code correctly it's because the code still uses the incorrect number of processors). So we had to recompile pftools.

As for the modification, I think it's simply changing one line: https://github.com/sib-swiss/pftools3/blob/0ab4a3753d7f2146a78ff590cc730ac37a30d9ac/src/C/utils/system.c#L911 Using _SC_NPROCESSORS_ONLN should do the trick. I think that it should not change anything for most systems (where __SCNPROCESSORS_CONF = _SC_NPROCESSORS_ONLN) and performance should be the same.

smoretti commented 3 years ago

I will talk about that with the boss when both we will be back from holidays.

We will let you know

tschuepb commented 3 years ago

At the time of writing I was not aware of _SC_NPROCESSORS_ONLN. I need to go back and check how I use nOverallCores within the affinity loops. This is to the best I can remember, so be nice. You are the first to limit the number of available processors and it could well screw the affinity stuff as I query the bios while running on each core. I suspect that enforcing the use of an offline core might cause a deadly issue. Would you know the type of kernel message you get: SIGFAULT, SIGKILL, ... Thank you. Thierry Schuepbach

tschuepb commented 3 years ago

Again at the time of writing, I used to bundle a SystemInfo within my code, have you run it and what does it give?

tschuepb commented 3 years ago

I looked deeper and indeed when affinity is enabled at compile time I will go over all cores according to nOverallProcs and query the BIOS. According to this, the software should not crash. However, it will be an issue as the affinity mask will be replaced by go everywhere allowed which is not ideal. If you could provide me if the output of SystemInfo and tell me more on your infrastructure, we could potentially find a fix. In the meantime, I advise you to continue disabling AFFINITY at compile time. Best regards, Thierry Schuepbach

duboism commented 3 years ago

This is to the best I can remember, so be nice.

Don't worry, I will be nice ;). The issue is a bit esoteric and seems related to the peculiar configuration at the computing center I use.

Would you know the type of kernel message you get: SIGFAULT, SIGKILL, ...

AFAIU, there is no signal; sched_setaffinity returns EINVAL (see here) and pfsearchV3 exits with return value 1.

duboism commented 3 years ago

Again at the time of writing, I used to bundle a SystemInfo within my code, have you run it and what does it give?

I don't understand what is SystemInfo in your message. I have found nothing like this in the sources nor using google (except for windows which is not applicable).

ynanyam commented 3 years ago

Just chiming in to say that we see the same behavior on our cluster nodes. The workaround we suggest is to request the whole node.

smoretti commented 3 years ago

Hi @duboism and @ynanyam

I have updated pftools v3 to version 3.2.7 incorporating your suggestion about _SC_NPROCESSORS_ONLN

Does it fix your cluster issue?

duboism commented 3 years ago

Hello,

Sorry for the late reply.

I did the following: compiled v3.2.7 (without any option) and launched pfsearchV3 (without any option). It works in the sense that the help is displayed.

Same procedure with v3.2.6 but pfsearchV3 crashes with the previous error.

So AFAIAC, the problem is solved. Did you have time to benchmark this ? Sorry I couldn't help you more on this.

smoretti commented 3 years ago

I have not fully benchmarked it, my tests agree with yours.

marcadella commented 2 years ago

Hi, I am currently using pftools 3.2.8 and I actually have the same issue: Error setting affinity! With small datasets it works fine (so just running pfsearchV3 without input file works fine), but for large datasets, I get the affinity issue.

smoretti commented 2 years ago

So you can try compiling pftoolsV3 without affinity. It should be fixed in the develop branch now see #25

mkdir build && cd build/
cmake .. -DUSE_AFFINITY=OFF

"old" machines and pftoolsV3 code for affinity are sometimes not very compatible

marcadella commented 2 years ago

Looks like master doesn't compile

In file included from /home/marcadella/test/pftools3/src/C/utils/output.c:14:0:
/home/marcadella/test/pftools3/src/C/utils/../include/pfSequence.h:50:18: error: field ‘LastModification’ has incomplete type
  struct timespec LastModification;
                  ^~~~~~~~~~~~~~~~
src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/build.make:62: recipe for target 'src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/output.c.o' failed
make[2]: *** [src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/output.c.o] Error 1
CMakeFiles/Makefile2:1819: recipe for target 'src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/all' failed
make[1]: *** [src/C/utils/CMakeFiles/OUTPUT_FORMAT.dir/all] Error 2
Makefile:162: recipe for target 'all' failed
make: *** [all] Error 2
smoretti commented 2 years ago

This is "expected", the patch is in the develop branch

marcadella commented 2 years ago

Yep, it works with the fix from #25