pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.01k stars 21.99k forks source link

crash@sleef_tryVXE2 () while trying to run torch.compile() BERT model #128503

Open pradghos opened 2 months ago

pradghos commented 2 months ago

🐛 Describe the bug

Hi,

I have been trying to load a pre-trained bert model bert-base-cased downloaded from HuggingFace and optimize it with torch.compile() on linux-s390x. It crashed -

0x000003ffb91c4418 in sleef_tryVXE2 () from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
(gdb) bt
#0  0x000003ffb91c4418 in sleef_tryVXE2 () from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#1  0x000003ffb91bedcc in cpuSupportsVXE2 () from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#2  0x000003ffb91c02c4 in disp_expf4_u10 () from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#3  0x000003ffb5dac610 in at::native::(anonymous namespace)::softmax_lastdim_kernel_impl(at::Tensor const&, at::Tensor const&) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#4  0x000003ffb4175890 in at::native::structured_softmax_cpu_out::impl(at::Tensor const&, long, bool, at::Tensor const&) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#5  0x000003ffb4e7ed92 in at::(anonymous namespace)::wrapper_CPU__softmax(at::Tensor const&, long, bool) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#6  0x000003ffb4e7ee28 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, long, bool), &at::(anonymous namespace)::wrapper_CPU__softmax>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, long, bool> >, at::Tensor (at::Tensor const&, long, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long, bool) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#7  0x000003ffb4b76af2 in at::_ops::_softmax::redispatch(c10::DispatchKeySet, at::Tensor const&, long, bool) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#8  0x000003ffb7949916 in torch::autograd::VariableType::(anonymous namespace)::_softmax(c10::DispatchKeySet, at::Tensor const&, long, bool) ()
  from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#9  0x000003ffb7949f8c in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, long, bool), &torch::autograd::VariableType::(anonymous namespace)::_softmax>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, long, bool> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, long, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long, bool) () from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
...
..
arType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long, std::optional<c10::ScalarType>) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#13 0x000003ffb4bf28b0 in at::_ops::softmax_int::call(at::Tensor const&, long, std::optional<c10::ScalarType>) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#14 0x000003ffbb39d020 in torch::autograd::THPVariable_softmax(_object*, _object*, _object*) ()
   from /root/anaconda3/envs/pytorch_test/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#15 0x00000000010a7162 in method_vectorcall_VARARGS_KEYWORDS (func=<error reading variable: value has been optimized out>, args=args@entry=0x3ffb0ae43b0,
    nargsf=<optimized out>, kwnames=kwnames@entry=0x0) at /usr/local/src/conda/python-3.10.14/Objects/descrobject.c:344

Version: Pytorch 2.2.0

Reproduce: Download pre-trained bert-base-cased from huggingface. https://huggingface.co/google-bert/bert-base-cased/tree/main

run the below scripts-


from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('/root/bert-base-cased')
model = BertModel.from_pretrained("/root/bert-base-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# optimize with torch.compile

compile_model = torch.compile(model)
output = compile_model(**encoded_input)  -- it will crash here. 

Versions

Version: Pytorch 2.2.0 python 3.10.14 transformers 4.24.0

platform Linux-s390x

xuhancn commented 2 months ago

@pradghos we have upgraded sleef library after v2.3. Could you please try to repoduce on latest daily build?

python -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
xuhancn commented 2 months ago

@pradghos No idea daily build if contains s390x OS, you can't build from source the release/2.4 branch code also.

malfet commented 2 months ago

cc: @AlekseiNikiforovIBM

Andreas-Krebbel commented 2 months ago

Apparently it is crashing in the routine trying to figure out whether VXE2 is available. Being on a z14 machine (not having VXE2) a SIGILL is supposed to happen. However, Sleef installs a signal handler to catch that. The SIGILL really should not surface to the application in that case. I've extracted the code and tried it on a z15 machine to detect a z16 feature and it worked fine. Compiling the following example with "gcc t.c -o t -march=z14 -mzvector" and running the program should result in an exit code of 0 on your box:

#include <stdint.h>
#include <stdio.h>
#include <assert.h>

#include <signal.h>
#include <setjmp.h>
#include <vecintrin.h>

__vector float sleef_cpuidtmp0;
__vector int sleef_cpuidtmp1;
static jmp_buf sigjmp;

__attribute__ ((__target__ ("arch=z15"))) void sleef_tryVXE2() {
  sleef_cpuidtmp0 = vec_float(sleef_cpuidtmp1);
}

static void sighandler(int signum) {
  longjmp(sigjmp, 1);
}

static int cpuSupportsVXE2() {
  static int cache = -1;
  if (cache != -1) return cache;

  void (*org);
  org = signal(SIGILL, sighandler);

  if (setjmp(sigjmp) == 0) {
    sleef_tryVXE2();
    cache = 1;
  } else {
    cache = 0;
  }

  signal(SIGILL, org);
  return cache;
}

int main ()
{
  return cpuSupportsVXE2();
}

Does this work as expected for you?

xuhancn commented 2 months ago

Apparently it is crashing in the routine trying to figure out whether VXE2 is available. Being on a z14 machine (not having VXE2) a SIGILL is supposed to happen. However, Sleef installs a signal handler to catch that. The SIGILL really should not surface to the application in that case. I've extracted the code and tried it on a z15 machine to detect a z16 feature and it worked fine. Compiling the following example with "gcc t.c -o t -march=z14 -mzvector" and running the program should result in an exit code of 0 on your box:

#include <stdint.h>
#include <stdio.h>
#include <assert.h>

#include <signal.h>
#include <setjmp.h>
#include <vecintrin.h>

__vector float sleef_cpuidtmp0;
__vector int sleef_cpuidtmp1;
static jmp_buf sigjmp;

__attribute__ ((__target__ ("arch=z15"))) void sleef_tryVXE2() {
  sleef_cpuidtmp0 = vec_float(sleef_cpuidtmp1);
}

static void sighandler(int signum) {
  longjmp(sigjmp, 1);
}

static int cpuSupportsVXE2() {
  static int cache = -1;
  if (cache != -1) return cache;

  void (*org);
  org = signal(SIGILL, sighandler);

  if (setjmp(sigjmp) == 0) {
    sleef_tryVXE2();
    cache = 1;
  } else {
    cache = 0;
  }

  signal(SIGILL, org);
  return cache;
}

int main ()
{
  return cpuSupportsVXE2();
}

Does this work as expected for you?

@Andreas-Krebbel FYI: https://github.com/pytorch/pytorch/pull/123936 merged last month, you can try the latest daily build.

pradghos commented 2 months ago

Apparently it is crashing in the routine trying to figure out whether VXE2 is available. Being on a z14 machine (not having VXE2) a SIGILL is supposed to happen. However, Sleef installs a signal handler to catch that. The SIGILL really should not surface to the application in that case. I've extracted the code and tried it on a z15 machine to detect a z16 feature and it worked fine. Compiling the following example with "gcc t.c -o t -march=z14 -mzvector" and running the program should result in an exit code of 0 on your box:

#include <stdint.h>
#include <stdio.h>
#include <assert.h>

#include <signal.h>
#include <setjmp.h>
#include <vecintrin.h>

__vector float sleef_cpuidtmp0;
__vector int sleef_cpuidtmp1;
static jmp_buf sigjmp;

__attribute__ ((__target__ ("arch=z15"))) void sleef_tryVXE2() {
  sleef_cpuidtmp0 = vec_float(sleef_cpuidtmp1);
}

static void sighandler(int signum) {
  longjmp(sigjmp, 1);
}

static int cpuSupportsVXE2() {
  static int cache = -1;
  if (cache != -1) return cache;

  void (*org);
  org = signal(SIGILL, sighandler);

  if (setjmp(sigjmp) == 0) {
    sleef_tryVXE2();
    cache = 1;
  } else {
    cache = 0;
  }

  signal(SIGILL, org);
  return cache;
}

int main ()
{
  return cpuSupportsVXE2();
}

Does this work as expected for you?

@Andreas-Krebbel Thanks ! It works fine for my system as expected with an exit code of 0.

Andreas-Krebbel commented 2 months ago

@Andreas-Krebbel FYI: #123936 merged last month, you can try the latest daily build.

Yeah, using getauxval is definitely the better way. Installing a signal handlers in a library might interfere with signal handlers used by the application. However, I'm curious to understand why it doesn't work in that particular case.

The merged code change adds proper facility detection to the aten code. Wouldn't we still need to do the same for Sleef then?!

Andreas-Krebbel commented 2 months ago

@Andreas-Krebbel Thanks ! It works fine for my system as expected with an exit code of 0.

Thanks for checking. I'm wondering why Sleef doing the same thing fails then :(

xuhancn commented 2 months ago

sleef_tryVXE2

It seems sleef not handle the disp_expf4_u10's dispatch correctly. pytorch 2.2 using the sleef that is two year's ago. I have upgrade the sleef version after pytorch 2.4. Still suggest you try the latest daily build: https://download.pytorch.org/whl/nightly/cpu

Andreas-Krebbel commented 2 months ago

I'll work on a PR for the issue in Sleef. See the Sleef issue for more details.

Btw. the backtrace from the first comment is a red herring (and I fell for it too at first). GDB by default intercepts SIGILLs and that's what you see here in your backtrace. But in that case this is the normal operation of the feature detection in Sleef. The SIGILL is expected to happen here. The actual problem is triggered later. In order to see this you have to tell GDB not to intercept SIGILLs. This can be done with:

handle SIGILL nostop