pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.39k stars 22.49k forks source link

matmul with CSR matrix in inference mode throws an exception #98004

Open KrisDemuynck opened 1 year ago

KrisDemuynck commented 1 year ago

🐛 Describe the bug

# BUG: matmul with CSR matrix in inference mode throws an exception
import torch;
x = torch.randn((4,8,5),dtype=torch.float);
M = torch.Tensor.to_sparse_csr(torch.eye(8));
# non-inference mode works
y = torch.matmul(M,x);
# inference mode throws RuntimeError: Cannot set version_counter for inference tensor
with torch.inference_mode():
     y = torch.matmul(M,x);

Versions

Collecting environment information... PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 12.0.76 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA TITAN Xp Nvidia driver version: 525.60.13 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz Stepping: 10 CPU MHz: 4300.000 CPU max MHz: 4700.0000 CPU min MHz: 800.0000 BogoMIPS: 7399.70 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 12 MiB NUMA node0 CPU(s): 0-11 Vulnerability Itlb multihit: KVM: Vulnerable Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.21.1 [pip3] numpydoc==0.7.0 [pip3] torch==1.13.1 [pip3] torchaudio==0.13.1 [pip3] torchvision==0.14.1 [conda] Could not collect

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @Lezcano @Varal7 @alexsamardzic @cpuhrsch @amjames @bhosmer

pat749 commented 1 year ago

A workaround for this issue is to avoid using torch.inference_mode() when working with sparse CSR matrices, or to convert them to dense tensors before using them in the operation.

import torch
x = torch.randn((4, 8, 5), dtype=torch.float)
M = torch.Tensor.to_sparse_csr(torch.eye(8))

# convert to dense tensor before using matmul
M_dense = M.to_dense()
with torch.inference_mode():
    y = torch.matmul(M_dense, x)

This changes will not throw error.

And also This is a known bug in PyTorch when using torch.inference_mode() with sparse CSR matrices. Inference mode is meant for faster and more memory-efficient inference computations by disabling some internal checks and features. However, for some operations with sparse tensors, such as torch.matmul, PyTorch needs to update internal version counters for correctness, and this is not allowed in inference mode. As a result, you may encounter a RuntimeError with the message "Cannot set version_counter for inference tensor" when running such operations.

You can read more here in this documentation of pytorch.

KrisDemuynck commented 1 year ago

Thanks for the quick and informative reply!

Given that my matrix is huge (4045x4045), very sparse (0.6% filled) and needs to be applied billions of times, disabling the inference mode is the best option (a solution I was already testing and which seems to work well; However this still felt like a "hack", hence the "bug" report).

As for the "This is a known bug" ...

Note: Adding your clear explanation to the pytorch sparse documentation would have been very useful for me.

pat749 commented 1 year ago

@KrisDemuynck I will pull request to add this. Thank you

malfet commented 1 year ago

Hmm, I can repro it with 2.0 build, digging a bit further... (just for my own education)

albanD commented 1 year ago

Note that for me, on latest master, the first matmul fails with RuntimeError: Sparse CSR tensors do not have strides.

cpuhrsch commented 1 year ago

@pearu - can you take a look? Looks like we might need a smoke test for this.

malfet commented 1 year ago

Note that for me, on latest master, the first matmul fails with RuntimeError: Sparse CSR tensors do not have strides.

Yeah, I'm seeing the same, but I've assumed, that this is because I build without MKL support, though on M1 everything works fine. It comes from calls to torch.expand:

* thread #1, name = 'python', stop reason = breakpoint 1.1
  * frame #0: 0x00007fffc765e7b0 libstdc++.so.6`__cxa_throw
    frame #1: 0x00007fffafca8b9b libc10.so`c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 235
    frame #2: 0x00007fffba398063 libtorch_cpu.so`at::SparseCsrTensorImpl::strides_custom() const + 1107
    frame #3: 0x00007fffba3b981a libtorch_cpu.so`at::TensorBase::strides() const + 42
    frame #4: 0x00007fffbaa98e27 libtorch_cpu.so`at::native::expand(at::Tensor const&, c10::ArrayRef<long>, bool) + 103
    frame #5: 0x00007fffbb7d66ad libtorch_cpu.so`at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__expand(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 77
    frame #6: 0x00007fffbb7d66fc libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__expand(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool> >, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 44
    frame #7: 0x00007fffbb406f59 libtorch_cpu.so`at::_ops::expand::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 297
    frame #8: 0x00007fffbd45ea77 libtorch_cpu.so`torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 135
    frame #9: 0x00007fffbd45eeef libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &(torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool))>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 47
    frame #10: 0x00007fffbb406f59 libtorch_cpu.so`at::_ops::expand::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 297
    frame #11: 0x00007fffbcd537f5 libtorch_cpu.so`torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 581
    frame #12: 0x00007fffbcd53d2f libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &(torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool))>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 47
    frame #13: 0x00007fffbb427dc8 libtorch_cpu.so`at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 392
    frame #14: 0x00007fffba81dddd libtorch_cpu.so`at::native::_matmul_impl(at::Tensor&, at::Tensor const&, at::Tensor const&) + 1773
    frame #15: 0x00007fffba81ed58 libtorch_cpu.so`at::native::matmul(at::Tensor const&, at::Tensor const&) + 88
    frame #16: 0x00007fffbb9a42a3 libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__matmul(at::Tensor const&, at::Tensor const&))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) + 35
    frame #17: 0x00007fffbb5446fb libtorch_cpu.so`at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) + 219
malfet commented 1 year ago

@pearu - can you take a look? Looks like we might need a smoke test for this.

@cpuhrsch come on, let me have some fun with it :)

pearu commented 1 year ago

Note that for me, on latest master, the first matmul fails with RuntimeError: Sparse CSR tensors do not have strides.

This typically means that backward machinery has kicked in and calling detach often helps.

malfet commented 1 year ago

@pearu do you have any recent examples of such behavior? Cannot set version_counter for inference tensor is indeed backward path being unhappy about grad tensor becomes a no-grad one

pearu commented 1 year ago

@malfet not really an answer but here's hopefully a useful hint:

>>> x = torch.randn((4,8,5),dtype=torch.float)
>>> M = torch.Tensor.to_sparse_csr(torch.eye(8)).requires_grad_(False)
>>> y = torch.matmul(M,x);
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Sparse CSR tensors do not have strides
>>> x = torch.randn((4,8,5),dtype=torch.float)
>>> M = torch.Tensor.to_sparse_csr(torch.eye(8)).requires_grad_(True)
>>> y = torch.matmul(M,x);
malfet commented 1 year ago

The change in behavior is caused by https://github.com/pytorch/pytorch/commit/6871665a973b33aedd0376294645e217978d1495 / https://github.com/pytorch/pytorch/pull/97355 And my favorite one line reproducer: python -c "import torch;torch.matmul(torch.eye(8).to_sparse_csr(),torch.rand(4,8,5))"

I guess the fix is to say, that sparse tensors are not expandable...