Using AOTInductor in C++ crashes if the model uses torch.linalg.eigh with CUDA

HanatoK commented 2 weeks ago

🐛 Describe the bug

torch.linalg.eigh crashes if the model is compiled into an AOTInductor model and used from the C++ side. The example python code is attached as follows:

import os
import torch

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(10, 16)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(16, 1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        y = x.T @ x
        v, w = torch.linalg.eigh(y)
        y = self.fc1(w)
        y = self.relu(y)
        y = self.fc2(y)
        y = 0.5 * (self.sigmoid(y) + 1.0)
        return y

with torch.no_grad():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = Model().to(device=device)
    example_inputs = (torch.randn((8, 10), device=device, requires_grad=False),)
    batch_dim = torch.export.Dim("batch", min=1, max=1024)
    so_path = torch._export.aot_compile(
        model,
        example_inputs,
        # Specify the first dimension of the input x as dynamic
        dynamic_shapes={"x": {0: batch_dim}},
        # Specify the generated shared library path
        options={"aot_inductor.output_path": os.path.join(os.getcwd(), "model2.so")},
    )

The C++ code to load the model is

#include <iostream>
#include <vector>

#include <torch/torch.h>
#include <torch/csrc/inductor/aoti_runner/model_container_runner_cuda.h>

int main() {
    c10::InferenceMode mode;

    torch::inductor::AOTIModelContainerRunnerCuda runner("./model2.so");
    std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
    std::vector<torch::Tensor> outputs = runner.run(inputs);
    std::cout << "Result from the first inference:"<< std::endl;
    std::cout << outputs[0] << std::endl;

    return 0;
}

GDB backtrace:

#0  0x00007fffebfcc20d in aoti_torch_proxy_executor_call_function () from /usr/local/lib/libtorch_cpu.so
#1  0x00007ffff7df097d in torch::aot_inductor::AOTInductorModel::run_impl(AtenTensorOpaque**, AtenTensorOpaque**, CUstream_st*, AOTIProxyExecutorOpaque*) () from ./model2.so
#2  0x00007ffff7dfeaa3 in torch::aot_inductor::AOTInductorModelContainer::run(AtenTensorOpaque**, AtenTensorOpaque**, CUstream_st*, AOTIProxyExecutorOpaque*) () from ./model2.so
#3  0x00007ffff7df209d in AOTInductorModelContainerRun () from ./model2.so
#4  0x00007fffebfbe611 in torch::inductor::AOTIModelContainerRunner::run(std::vector<at::Tensor, std::allocator<at::Tensor> >&, AOTInductorStreamOpaque*) () from /usr/local/lib/libtorch_cpu.so
#5  0x00007fff700486dd in torch::inductor::AOTIModelContainerRunnerCuda::run(std::vector<at::Tensor, std::allocator<at::Tensor> >&) () from /usr/local/lib/libtorch_cuda.so
#6  0x00000000004049b1 in main ()

Versions

Collecting environment information... PyTorch version: 2.6.0a0+gitc6609ec Is debug build: False CUDA used to build PyTorch: 12.5 ROCM used to build PyTorch: N/A

OS: openSUSE Tumbleweed (x86_64) GCC version: (SUSE Linux) 14.2.1 20241007 [revision 4af44f2cf7d281f3e4f3957efce10e8b2ccb2ad3] Clang version: 18.1.8 CMake version: version 3.30.4 Libc version: glibc-2.40

Python version: 3.11.10 (main, Sep 09 2024, 17:03:08) [GCC] (64-bit runtime) Python platform: Linux-6.11.3-1-default-x86_64-with-glibc2.40 Is CUDA available: True CUDA runtime version: 12.5.82 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU Nvidia driver version: 550.120 cuDNN version: Probably one of the following: /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn.so.9.3.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn_adv.so.9.3.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn_cnn.so.9.3.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.3.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.3.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn_graph.so.9.3.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.3.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudnn_ops.so.9.3.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 5800H with Radeon Graphics CPU family: 25 Model: 80 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 67% CPU max MHz: 4463.0000 CPU min MHz: 400.0000 BogoMIPS: 6390.91 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Virtualization: AMD-V L1d cache: 256 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 4 MiB (8 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] flake8==7.1.1 [pip3] mypy_extensions==1.0.0 [pip3] numpy==2.1.1 [pip3] numpydoc==1.7.0 [pip3] torch==2.6.0a0+gitc6609ec [pip3] triton==3.1.0 [conda] No relevant packages

cc @ezyang @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @chenyang78

HanatoK commented 2 weeks ago

AOTIModelContainerRunnerCuda crashes (with tensors on the CUDA device) but AOTIModelContainerRunnerCpu does not.

desertfire commented 1 week ago

@angelayi , can you help to take a look? The backtrace points to aoti_torch_proxy_executor_call_function.

pytorch / pytorch

Using AOTInductor in C++ crashes if the model uses torch.linalg.eigh with CUDA #138601

🐛 Describe the bug

Versions