shibatch / sleef

SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT
https://sleef.org
Boost Software License 1.0
628 stars 126 forks source link

S/390: Use getauxval for detecting VXE2 to fix #560 #561

Closed Andreas-Krebbel closed 3 weeks ago

Andreas-Krebbel commented 1 month ago

Change the detection mechanism for the VXE2 feature from signal handling to getauxval. This is a bit simpler and also helps in multi-threaded uses of the library as in PyTorch: https://github.com/pytorch/pytorch/issues/128503

In particular it prevents the sigjmp file scope variable from being accessed from multiple threads. This used to cause crashes with the way PyTorch uses Sleef in combination with OpenMP. In that case cpuSupportsExt is executed simultaneously by multiple threads what led to parallel writers to sigjmp. This used to be no problem running on IBM z15 and newer, where there is actually VXE2 available. However, running on IBM z14 the signal handler gets executed what actually makes use of the corrupted sigjmp data structure to get back into cpuSupportsExt.

blapie commented 1 month ago

This looks like a brilliant alternative. However, to be able to accept this we would need to have the confirmation that it is applicable to all other supported target architectures (x86, aarch64/32, ppc64, risc-v, ...) and all supported platforms (here I'm more concerned the availability of sys/auxv.h on Windows).

@shibatch Is there anything fundamental preventing us from using this approach for feature detection? This looks ideal since it does not require compiling code snippets, am I missing something?

shibatch commented 1 month ago

This would be fine, but the new code depends on the ELF binary loader and glibc, right? From a portability point of view, adding a mutex with pthreads to the original code may be another option.

On Wed, Jul 10, 2024 at 12:02 AM Pierre Blanchard @.***> wrote:

This looks like a brilliant alternative. However, to be able to accept this we would need to have the confirmation that it is applicable to all other supported target architectures (x86, aarch64/32, ppc64, risc-v, ...) and all supported platforms (here I'm more concerned the availability of sys/auxv.h on Windows).

@shibatch https://github.com/shibatch Is there anything fundamental preventing us from using this approach for feature detection? This looks ideal since it does not require compiling code snippets, am I missing something?

— Reply to this email directly, view it on GitHub https://github.com/shibatch/sleef/pull/561#issuecomment-2217961122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2A7F2RUA2ZPY4ZGQUIKTTZLP3QHAVCNFSM6AAAAABKLSEZ46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXHE3DCMJSGI . You are receiving this because you were mentioned.Message ID: @.***>

Andreas-Krebbel commented 1 month ago

This would be fine, but the new code depends on the ELF binary loader and glibc, right?

Yes. The HWCAP bits are placed on the stack by the linux kernel before giving control to the binary. getauxval is a glibc function available since 2.16. musl libc seems to have it as well, but newlib doesn't. While this shouldn't be a problem for s390x, other targets might have to gate this with macro checks.

From a portability point of view, adding a mutex with pthreads to the original code may be another option.

I'm not sure whether this can be done easily. Right now sigjmp is written to in the normal path of execution, but it is used from a signal handler which might get executed on a different thread even. Perhaps an atomic variable set in the signal handler might be a solution here?! If the variable hasn't been updated by the signal handler cpuSupportsExt could update it to a positive result. Nevertheless, I think it would be good to avoid signal handlers in the lib as much as possible. This could also interact badly with signal handlers installed in applications using the library.

x86 uses the cpuid instruction to check for the availability of features. Targets which do not have unprivileged instructions like that, usually use HWCAPs to get this information carried over from the kernel. So I think this is a pretty common mechanism.

shibatch commented 1 month ago

If, as @blapie says, we are going to make similar changes to the code for all architectures, then we will have to think about how to do it for windows and macos.

For me, I think it would be easier to add mutex to the existing code. I don't have a particular preference as long as you modify it so that it works properly.

I also don't think that using signal and mutex will cause much of a problem. Here is a PoC code with mutex.

#include <stdio.h>
#include <signal.h>
#include <setjmp.h>
#include <pthread.h>

#define ENABLE_MUTEX

pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

static sigjmp_buf sigjmp;
#define SETJMP(x) sigsetjmp(x, 1)
#define LONGJMP siglongjmp

static void sighandler(int signum) {
  LONGJMP(sigjmp, 1);
}

static int cpuSupportsExt(void (*tryExt)()) {
  static int cache = -1;

#ifdef ENABLE_MUTEX
  pthread_mutex_lock(&mtx);
#else
  if (cache != -1) return cache;
#endif

  void (*org);
  org = signal(SIGSEGV, sighandler);  //org = signal(SIGILL, sighandler);

  if (SETJMP(sigjmp) == 0) {
    (*tryExt)();
    cache = 1;
  } else {
    cache = 0;
  }

  signal(SIGILL, org);

#ifdef ENABLE_MUTEX
  pthread_mutex_unlock(&mtx);
#endif

  return cache;
}

void segf() { *(int *)NULL = 0; }

int main(int argc, char **argv) {
  int r = 0;

#pragma omp parallel for
  for(int i=0;i<0x100000;i++) {
    r = cpuSupportsExt(segf) || r;
  }
  printf("%d\n", r);
}
Andreas-Krebbel commented 1 month ago

I think in general it would be good to avoid using signal handlers as much as possible here and only use it as last resort. There are several situations where this could break:

  1. The application or other libraries might use similar mechanisms to detect availability of instructions. A SIGILL triggered on a different thread (not using the mutex) might result in a double release of the mutex.
  2. Other threads might install SIGILL handlers on their own, intercepting the signal triggered by Sleef.
  3. "signal" itself is not thread-safe. On Linux "sigaction" should be used instead, but that's again only for Linux. With your example "signal" is done under the lock, but still there might be other threads invoking signal without knowing about the lock.

Adding the mutex makes the situation a bit better, but I think it would be good to revisit for every supported architecture whether an alternate (non-signal) solution could be implemented.

I'll try to help as much as I can, but I think the s390x specific PR to switch to getauxval could be merged independently. For s390x we will never need to fall back to the signal handler approach.

shibatch commented 1 month ago

As for me, I am not particularly against the idea, so I think it is fine as it is. The method using SIGILL should have been positioned as a last resort from the beginning. In any case, if the method is to be used on other architectures, it is quite possible that a method that does not depend on glibc functionality will be needed somewhere. For that reason, I think it is better to keep the SIGILL-based method. And that means we need to add mutex to the method. But you can remove the SIGILL-based method for s390x.

blapie commented 1 month ago

Hello,

Let's try to merge this if it's indeed solving a problem for you, but we need to extend to all other architectures quickly, as I'm worried about the cost of maintaining different detection mechanism for different architectures. Luckily it does look like your approach applies to most architectures anyway.

I'm a little less concerned about the fact that we have to use separate approaches for different OS-es (with or without glibc), there are fewer of them and it makes more sense to use different mechanisms in that case.

Andreas-Krebbel commented 1 month ago

Let's try to merge this if it's indeed solving a problem for you, but we need to extend to all other architectures quickly, as I'm worried about the cost of maintaining different detection mechanism for different architectures.

Thanks. I've also tried to integrate the mutex approach and it seems to work as well, although as discussed above, probably is not as robust as checking HWCAPs. But I think also the mutex would need some special handling on Windows. On Windows the static initializer of the mutex does not seem to work. There are alternatives though, but I don't know what is best here.

pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
blapie commented 1 month ago

Ok, thanks for trying that! Could you please upload a separate PR with the mutex in place for macos at least and guidance for windows maybe?

Other architectures than x86 and aarch64 imply that we are under Linux so the current PR is probably fine as is. For x86 and aarch64 though, we should probably stick to mutex or we will need to find a way to switch detection mechanism.

Andreas-Krebbel commented 1 month ago

Ok, thanks for trying that! Could you please upload a separate PR with the mutex in place for macos at least and guidance for windows maybe?

Other architectures than x86 and aarch64 imply that we are under Linux so the current PR is probably fine as is. For x86 and aarch64 though, we should probably stick to mutex or we will need to find a way to switch detection mechanism.

I've opened a draft PR for that now. The static mutex initializer probably only works on Linux right now. Didn't test Windows and MacOS, but some quick search indicates that this needs to be done differently there. Don't know about MingGW.

https://github.com/shibatch/sleef/pull/563

blapie commented 3 weeks ago

Unfortunately, cannot really test this on an actual machine, so will have to trust this is actually fixing the issue. Would be good to quickly extend this approach to at least ppc64 (power9/vsx3 detection).