Open pradghos opened 2 months ago
@pradghos we have upgraded sleef library after v2.3. Could you please try to repoduce on latest daily build?
python -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
@pradghos No idea daily build if contains s390x
OS, you can't build from source the release/2.4
branch code also.
cc: @AlekseiNikiforovIBM
Apparently it is crashing in the routine trying to figure out whether VXE2 is available. Being on a z14 machine (not having VXE2) a SIGILL is supposed to happen. However, Sleef installs a signal handler to catch that. The SIGILL really should not surface to the application in that case. I've extracted the code and tried it on a z15 machine to detect a z16 feature and it worked fine. Compiling the following example with "gcc t.c -o t -march=z14 -mzvector" and running the program should result in an exit code of 0 on your box:
#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <signal.h>
#include <setjmp.h>
#include <vecintrin.h>
__vector float sleef_cpuidtmp0;
__vector int sleef_cpuidtmp1;
static jmp_buf sigjmp;
__attribute__ ((__target__ ("arch=z15"))) void sleef_tryVXE2() {
sleef_cpuidtmp0 = vec_float(sleef_cpuidtmp1);
}
static void sighandler(int signum) {
longjmp(sigjmp, 1);
}
static int cpuSupportsVXE2() {
static int cache = -1;
if (cache != -1) return cache;
void (*org);
org = signal(SIGILL, sighandler);
if (setjmp(sigjmp) == 0) {
sleef_tryVXE2();
cache = 1;
} else {
cache = 0;
}
signal(SIGILL, org);
return cache;
}
int main ()
{
return cpuSupportsVXE2();
}
Does this work as expected for you?
Apparently it is crashing in the routine trying to figure out whether VXE2 is available. Being on a z14 machine (not having VXE2) a SIGILL is supposed to happen. However, Sleef installs a signal handler to catch that. The SIGILL really should not surface to the application in that case. I've extracted the code and tried it on a z15 machine to detect a z16 feature and it worked fine. Compiling the following example with "gcc t.c -o t -march=z14 -mzvector" and running the program should result in an exit code of 0 on your box:
#include <stdint.h> #include <stdio.h> #include <assert.h> #include <signal.h> #include <setjmp.h> #include <vecintrin.h> __vector float sleef_cpuidtmp0; __vector int sleef_cpuidtmp1; static jmp_buf sigjmp; __attribute__ ((__target__ ("arch=z15"))) void sleef_tryVXE2() { sleef_cpuidtmp0 = vec_float(sleef_cpuidtmp1); } static void sighandler(int signum) { longjmp(sigjmp, 1); } static int cpuSupportsVXE2() { static int cache = -1; if (cache != -1) return cache; void (*org); org = signal(SIGILL, sighandler); if (setjmp(sigjmp) == 0) { sleef_tryVXE2(); cache = 1; } else { cache = 0; } signal(SIGILL, org); return cache; } int main () { return cpuSupportsVXE2(); }
Does this work as expected for you?
@Andreas-Krebbel FYI: https://github.com/pytorch/pytorch/pull/123936 merged last month, you can try the latest daily build.
Apparently it is crashing in the routine trying to figure out whether VXE2 is available. Being on a z14 machine (not having VXE2) a SIGILL is supposed to happen. However, Sleef installs a signal handler to catch that. The SIGILL really should not surface to the application in that case. I've extracted the code and tried it on a z15 machine to detect a z16 feature and it worked fine. Compiling the following example with "gcc t.c -o t -march=z14 -mzvector" and running the program should result in an exit code of 0 on your box:
#include <stdint.h> #include <stdio.h> #include <assert.h> #include <signal.h> #include <setjmp.h> #include <vecintrin.h> __vector float sleef_cpuidtmp0; __vector int sleef_cpuidtmp1; static jmp_buf sigjmp; __attribute__ ((__target__ ("arch=z15"))) void sleef_tryVXE2() { sleef_cpuidtmp0 = vec_float(sleef_cpuidtmp1); } static void sighandler(int signum) { longjmp(sigjmp, 1); } static int cpuSupportsVXE2() { static int cache = -1; if (cache != -1) return cache; void (*org); org = signal(SIGILL, sighandler); if (setjmp(sigjmp) == 0) { sleef_tryVXE2(); cache = 1; } else { cache = 0; } signal(SIGILL, org); return cache; } int main () { return cpuSupportsVXE2(); }
Does this work as expected for you?
@Andreas-Krebbel Thanks ! It works fine for my system as expected with an exit code of 0.
@Andreas-Krebbel FYI: #123936 merged last month, you can try the latest daily build.
Yeah, using getauxval is definitely the better way. Installing a signal handlers in a library might interfere with signal handlers used by the application. However, I'm curious to understand why it doesn't work in that particular case.
The merged code change adds proper facility detection to the aten code. Wouldn't we still need to do the same for Sleef then?!
@Andreas-Krebbel Thanks ! It works fine for my system as expected with an exit code of 0.
Thanks for checking. I'm wondering why Sleef doing the same thing fails then :(
sleef_tryVXE2
It seems sleef not handle the disp_expf4_u10
's dispatch correctly. pytorch 2.2 using the sleef that is two year's ago.
I have upgrade the sleef version after pytorch 2.4. Still suggest you try the latest daily build: https://download.pytorch.org/whl/nightly/cpu
I'll work on a PR for the issue in Sleef. See the Sleef issue for more details.
Btw. the backtrace from the first comment is a red herring (and I fell for it too at first). GDB by default intercepts SIGILLs and that's what you see here in your backtrace. But in that case this is the normal operation of the feature detection in Sleef. The SIGILL is expected to happen here. The actual problem is triggered later. In order to see this you have to tell GDB not to intercept SIGILLs. This can be done with:
handle SIGILL nostop
🐛 Describe the bug
Hi,
I have been trying to load a pre-trained bert model
bert-base-cased
downloaded from HuggingFace and optimize it with torch.compile() on linux-s390x. It crashed -Version:
Pytorch 2.2.0Reproduce: Download pre-trained
bert-base-cased
from huggingface. https://huggingface.co/google-bert/bert-base-cased/tree/mainrun the below scripts-
Versions
Version: Pytorch 2.2.0 python 3.10.14 transformers 4.24.0
platform Linux-s390x