build error and runtime error

microsoft / chunk-attention

MIT License

27 stars 4 forks source link

Device: NVIDIA GeForce RTX 4090 D, Cuda compilation tools, release 12.2, V12.2.91 Build cuda_12.2.r12.2/compiler.32965470_0 gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) torch 2.3.0 Python 3.10 In the execution of the example code where f = host.predict_async(prompt_tokens, 32), an error occurred: RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx(handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

microsoft / chunk-attention

build error and runtime error #2