Closed xuzhao9 closed 9 months ago
Steps to reproduce:
docker pull ghcr.io/pytorch/torchbench:latest
docker run -it --gpus all ghcr.io/pytorch/torchbench:latest /bin/bash
cd /workspace/benchmark; python run.py yolov3 -d cuda -t eval
cc @malfet
Stack backtrace:
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f8b720afccc in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
(gdb) bt
#0 0x00007f8b720afccc in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
#1 0x00007f8b720b1cf5 in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
#2 0x00007f8b720b9924 in cublasLtMatmulAlgoGetHeuristic () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11
#3 0x00007f8aa184ba34 in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#4 0x00007f8aa184c317 in cudnn::cnn::WinogradNonfusedEngine<true>::initSupported() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#5 0x00007f8aa14482f1 in cudnn::cnn::EngineInterface::isSupported() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#6 0x00007f8aa1526d99 in cudnn::cnn::GeneralizedConvolutionEngine<cudnn::cnn::WinogradNonfusedEngine<true> >::initSupported() ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#7 0x00007f8aa14482f1 in cudnn::cnn::EngineInterface::isSupported() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#8 0x00007f8aa134d098 in cudnn::backend::ExecutionPlan::finalize_internal() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#9 0x00007f8aa1335772 in ?? () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#10 0x00007f8aa1335e49 in cudnn::backend::finalizeDescriptor(void*) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#11 0x00007f8aa1338673 in cudnnBackendFinalize () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#12 0x00007f8bc7dc31df in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#13 0x00007f8bc7dc4f63 in cudnn_frontend::EngineConfigGenerator::cudnnGetPlan(cudnnContext*, cudnn_frontend::OperationGraph_v8&) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#14 0x00007f8bc7dab735 in at::native::generate_and_filter_plans(cudnnContext*, cudnn_frontend::OperationGraph_v8&, cudnn_frontend::EngineConfigGenerator&, at::Tensor const&, std::vector<cudnn_frontend::ExecutionPlan_v8, std::allocator<cudnn_frontend::ExecutionPlan_v8> >&, c10::DataPtr&) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#15 0x00007f8bc7db42db in at::native::run_single_conv(cudnnBackendDescriptorType_t, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#16 0x00007f8bc7db4cfb in at::native::raw_cudnn_convolution_forward_out(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007f8bc7d98a87 in at::native::cudnn_convolution_forward(char const*, at::TensorArg const&, at::TensorArg const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#18 0x00007f8bc7d990a6 in at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007f8bc9e1fd7e in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007f8bc9e36de1 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cudnn_convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#21 0x00007f8bff7431fb in at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007f8bfe9ba58b in at::native::_convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007f8bffb06bff in at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007f8bffb0d55c in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007f8bff262f34 in at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007f8bfe9ad5b8 in at::native::convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) ()
from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007f8bffb0649c in at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007f8bffb0d3c8 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt> >, at::Tensor (at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt) () from /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
I suspect it's https://github.com/pytorch/pytorch/pull/114620 (TorchBench is still tested against 11.8 isn't it?)
@xuzhao9 not sure what is wrong with the setup, but to me command fails with Arguments: (NVMLError_LibRmVersionMismatch(18),)
@malfet Yes, the docker image is 11.8 by default, but it also comes with 12.1. I actually tried both 11.8 and 12.1, and both of them fail. The stack backtraces are similar.
Could you please also try the following?
cd /workspace/benchmark; python run.py yolov3 -d cuda -t train
Ok, I can not run the thing inside the docker container, but test runs fine using cuda-11.8 on A100 using the the same build outside of the container:
$ python run.py yolov3 -d cuda -t train
Running train method from yolov3 on cuda in eager mode with input batch size 16 and precision fp32.
/home/nshulga/git/pytorch/benchmark/torchbenchmark/models/yolov3/yolo_utils/utils.py:349: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
lcls, lbox, lobj = ft([0]), ft([0]), ft([0])
/home/nshulga/miniconda3/envs/py311-cu118/lib/python3.11/site-packages/torch/cuda/memory.py:440: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
GPU Time per batch: 111.107 milliseconds
CPU Wall Time per batch: 111.138 milliseconds
Time to first batch: 5509.6099 ms
GPU 0 Peak Memory: 10.8764 GB
CPU Peak Memory: 5.3818 GB
OK, few updates: This seems to be due to the fact that torchbench installs tensorrt which depends on cuda12 somehow, though I still can not reproduce the crash outside of docker.
@xuzhao9 next time it crashes to you, do you mind running:
lldb -- python run.py yolov3 -d cuda -t train
and when it crashes, run image list
to print which libraries are loaded
The torch-tensorrt owner submitted https://github.com/pytorch/benchmark/pull/2092, which should fix this problem.
@malfet Here is the result (internal only): P913987841
Here is how to reproduce the segmentation fault:
LD_LIBRARAY_PATH=$HOME/cuda-11.8 python run.py yolov3 -d cuda -t train
cc @malfet
Workarounded by not installing tensorrt libraries.
Multiple CUDA tests have segfault, CUDA Error and eager run fail: https://github.com/pytorch/benchmark/actions/runs/7175871410/job/19539876195
The problem happened between 2023-11-30 and 2023-12-01 nightly release. Kicked off a bisection workflow to understand why: https://github.com/pytorch/benchmark/actions/runs/7183986257