Calling cublasDgetrfBatched failed with pycuda

Description

I know this could be me, but I had a hard time in understanding the form of input matrix, to which a LU decomposition is performed. I failed to call the low-level interface cublas.cublasDgetrfBatched in my little example.

import numpy as np
import pycuda.autoinit
import skcuda.cublas as cublas
import pycuda.gpuarray as gpuarray

N = 10
N_BATCH = 1  # only 1 matrix to be decomposed
A_SHAPE = (N, N)

a = np.random.rand(*A_SHAPE).astype(np.float64)
a_batch = np.expand_dims(a, axis=0)

a_gpu = gpuarray.to_gpu(a_batch.T.copy())  # transpose a to follow "F" order
p_gpu = gpuarray.zeros(N * N_BATCH, np.int32)
info_gpu = gpuarray.zeros(N_BATCH, np.int32)

cublas_handle = cublas.cublasCreate()
cublas.cublasDgetrfBatched(
    cublas_handle,
    N,
    a_gpu.gpudata,
    N,
    p_gpu.gpudata,
    info_gpu.gpudata,
    N_BATCH,
)

cublas.cublasDestroy(cublas_handle)
print(a_gpu)

Problem

PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered

Environment

OS platform: Ubuntu 18.04
Python version: 3.7.6
CUDA version: 11.2
PyCUDA version: 2018.1.1
scikit-cuda version: 0.5.3

lebedov / scikit-cuda

Calling cublasDgetrfBatched failed with pycuda #332

Description

Problem

Environment