pytorch / benchmark

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
BSD 3-Clause "New" or "Revised" License
866 stars 281 forks source link

CI Failures on A100 starting from 2023-12-01 #2081

Closed xuzhao9 closed 9 months ago

xuzhao9 commented 10 months ago

Multiple CUDA tests have segfault, CUDA Error and eager run fail: https://github.com/pytorch/benchmark/actions/runs/7175871410/job/19539876195

The problem happened between 2023-11-30 and 2023-12-01 nightly release. Kicked off a bisection workflow to understand why: https://github.com/pytorch/benchmark/actions/runs/7183986257

xuzhao9 commented 10 months ago

Steps to reproduce:

docker pull ghcr.io/pytorch/torchbench:latest
docker run -it --gpus all ghcr.io/pytorch/torchbench:latest /bin/bash
cd /workspace/benchmark; python run.py yolov3 -d cuda -t eval

cc @malfet

xuzhao9 commented 10 months ago

Stack backtrace:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f8b720afccc in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
(gdb) bt
#0  0x00007f8b720afccc in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
#1  0x00007f8b720b1cf5 in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
#2  0x00007f8b720b9924 in cublasLtMatmulAlgoGetHeuristic () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
#3  0x00007f8aa184ba34 in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#4  0x00007f8aa184c317 in cudnn::cnn::WinogradNonfusedEngine<true>::initSupported() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#5  0x00007f8aa14482f1 in cudnn::cnn::EngineInterface::isSupported() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#6  0x00007f8aa1526d99 in cudnn::cnn::GeneralizedConvolutionEngine<cudnn::cnn::WinogradNonfusedEngine<true> >::initSupported() ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#7  0x00007f8aa14482f1 in cudnn::cnn::EngineInterface::isSupported() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#8  0x00007f8aa134d098 in cudnn::backend::ExecutionPlan::finalize_internal() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#9  0x00007f8aa1335772 in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#10 0x00007f8aa1335e49 in cudnn::backend::finalizeDescriptor(void*) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#11 0x00007f8aa1338673 in cudnnBackendFinalize () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#12 0x00007f8bc7dc31df in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#13 0x00007f8bc7dc4f63 in cudnn_frontend::EngineConfigGenerator::cudnnGetPlan(cudnnContext*, cudnn_frontend::OperationGraph_v8&) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#14 0x00007f8bc7dab735 in at::native::generate_and_filter_plans(cudnnContext*, cudnn_frontend::OperationGraph_v8&, cudnn_frontend::EngineConfigGenerator&, at::Tensor const&, std::vector<cudnn_frontend::ExecutionPlan_v8, std::allocator<cudnn_frontend::ExecutionPlan_v8> >&, c10::DataPtr&) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#15 0x00007f8bc7db42db in at::native::run_single_conv(cudnnBackendDescriptorType_t, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#16 0x00007f8bc7db4cfb in at::native::raw_cudnn_convolution_forward_out(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007f8bc7d98a87 in at::native::cudnn_convolution_forward(char const*, at::TensorArg const&, at::TensorArg const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#18 0x00007f8bc7d990a6 in at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007f8bc9e1fd7e in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007f8bc9e36de1 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cudnn_convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#21 0x00007f8bff7431fb in at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007f8bfe9ba58b in at::native::_convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007f8bffb06bff in at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007f8bffb0d55c in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007f8bff262f34 in at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007f8bfe9ad5b8 in at::native::convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) ()
   from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007f8bffb0649c in at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007f8bffb0d3c8 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt> >, at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
malfet commented 10 months ago

I suspect it's https://github.com/pytorch/pytorch/pull/114620 (TorchBench is still tested against 11.8 isn't it?)

malfet commented 10 months ago

@xuzhao9 not sure what is wrong with the setup, but to me command fails with Arguments: (NVMLError_LibRmVersionMismatch(18),)

xuzhao9 commented 10 months ago

@malfet Yes, the docker image is 11.8 by default, but it also comes with 12.1. I actually tried both 11.8 and 12.1, and both of them fail. The stack backtraces are similar.

Could you please also try the following?

cd /workspace/benchmark; python run.py yolov3 -d cuda -t train
malfet commented 10 months ago

Ok, I can not run the thing inside the docker container, but test runs fine using cuda-11.8 on A100 using the the same build outside of the container:

$ python run.py yolov3 -d cuda -t train
Running train method from yolov3 on cuda in eager mode with input batch size 16 and precision fp32.
/home/nshulga/git/pytorch/benchmark/torchbenchmark/models/yolov3/yolo_utils/utils.py:349: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  lcls, lbox, lobj = ft([0]), ft([0]), ft([0])
/home/nshulga/miniconda3/envs/py311-cu118/lib/python3.11/site-packages/torch/cuda/memory.py:440: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(
GPU Time per batch:  111.107 milliseconds
CPU Wall Time per batch: 111.138 milliseconds
Time to first batch:         5509.6099 ms
GPU 0 Peak Memory:             10.8764 GB
CPU Peak Memory:                5.3818 GB
malfet commented 10 months ago

OK, few updates: This seems to be due to the fact that torchbench installs tensorrt which depends on cuda12 somehow, though I still can not reproduce the crash outside of docker.

@xuzhao9 next time it crashes to you, do you mind running:

lldb -- python run.py yolov3 -d cuda -t train

and when it crashes, run image list to print which libraries are loaded

xuzhao9 commented 10 months ago

The torch-tensorrt owner submitted https://github.com/pytorch/benchmark/pull/2092, which should fix this problem.

xuzhao9 commented 10 months ago

@malfet Here is the result (internal only): P913987841

xuzhao9 commented 10 months ago

Here is how to reproduce the segmentation fault:

  1. Install both CUDA 11.8 and CUDNN 8.7 into a directory, say $HOME/cuda-11.8
  2. Install torch nightly cu118
  3. Install nvidia-*-cu12 packages
  4. Run LD_LIBRARAY_PATH=$HOME/cuda-11.8 python run.py yolov3 -d cuda -t train

cc @malfet

xuzhao9 commented 9 months ago

Workarounded by not installing tensorrt libraries.