pyscf / gpu4pyscf

A plugin to use Nvidia GPU in PySCF package
GNU General Public License v3.0
141 stars 25 forks source link

CUBLAS_STATUS_EXECUTION_FAILED #124

Open GiacomoDG96 opened 8 months ago

GiacomoDG96 commented 8 months ago

Hi, I am trying to replicate the example https://github.com/pyscf/gpu4pyscf/blob/master/examples/00-h2o.py using a benzene molecule instead of water and I am obtaining the same error as replicating the https://github.com/pyscf/gpu4pyscf/blob/master/examples/07-transition_state.py example with the molecule define in that file.

The error that I obtain is: ######################################################################################### Traceback (most recent call last): File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/df/df_jk.py", line 63, in init_workflow rks.initialize_grids(mf, mf.mol, dm0) File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/dft/rks.py", line 83, in initialize_grids ks.grids = prune_small_rhogrids(ks, ks.mol, dm, ks.grids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/dft/rks.py", line 39, in prune_small_rhogrids rho = ks._numint.get_rho(mol, dm, grids, ks.max_memory) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/dft/numint.py", line 721, in get_rho rho[p0:p1] = eval_rho2(mol, ao_mask, mo_coeff_mask, mo_occ, None, 'LDA', with_lapl) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/dft/numint.py", line 200, in eval_rho2 c0 = _dot_ao_dm(mol, ao, cpos, non0tab, shls_slice, ao_loc) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/dft/numint.py", line 1476, in _dot_ao_dm return cupy.dot(dm.T, ao) ^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/cupy/linalg/_product.py", line 63, in dot return a.dot(b, out) ^^^^^^^^^^^^^ File "cupy/_core/core.pyx", line 1757, in cupy._core.core._ndarray_base.dot File "cupy/_core/_routines_linalg.pyx", line 536, in cupy._core._routines_linalg.dot File "cupy/_core/_routines_linalg.pyx", line 626, in cupy._core._routines_linalg.tensordot_core File "cupy/_core/_routines_linalg.pyx", line 763, in cupy._core._routines_linalg.tensordot_core_v11 File "cupy_backends/cuda/libs/cublas.pyx", line 1426, in cupy_backends.cuda.libs.cublas.gemmEx File "cupy_backends/cuda/libs/cublas.pyx", line 1454, in cupy_backends.cuda.libs.cublas.gemmEx File "cupy_backends/cuda/libs/cublas.pyx", line 438, in cupy_backends.cuda.libs.cublas.check_status cupy_backends.cuda.libs.cublas.CUBLASError: CUBLAS_STATUS_NOT_INITIALIZED

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/pyscf/lib/misc.py", line 1104, in exit handler.result() File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/df/df_jk.py", line 43, in build_df mf.with_df.build() File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/df/df.py", line 90, in build self._cderi = cholesky_eri_gpu(intopt, mol, auxmol, self.cd_low, omega=omega) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/df/df.py", line 265, in cholesky_eri_gpu cderi_block = solve_triangular(cd_low, ints_slices, lower=True, overwrite_b=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/cupyx/scipy/linalg/_solve_triangular.py", line 97, in solve_triangular trsm( File "cupy_backends/cuda/libs/cublas.pyx", line 1109, in cupy_backends.cuda.libs.cublas.dtrsm File "cupy_backends/cuda/libs/cublas.pyx", line 1119, in cupy_backends.cuda.libs.cublas.dtrsm File "cupy_backends/cuda/libs/cublas.pyx", line 438, in cupy_backends.cuda.libs.cublas.check_status cupy_backends.cuda.libs.cublas.CUBLASError: CUBLAS_STATUS_EXECUTION_FAILED

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/soralakers96/CODE/gpu4pyscf/gpu4pyscf/examples/07-transition_state.py", line 68, in mf_GPU.kernel() File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/scf/hf.py", line 588, in scf _kernel(mf, mf.conv_tol, mf.conv_tol_grad, File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/scf/hf.py", line 404, in _kernel mf.init_workflow(dm0=dm) File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/gpu4pyscf/df/df_jk.py", line 56, in init_workflow with lib.call_in_background(build_df) as build: File "/home/soralakers96/anaconda3/envs/trail_actmol/lib/python3.12/site-packages/pyscf/lib/misc.py", line 1106, in exit raise ThreadRuntimeError('Error on thread %s:\n%s' % (self, e)) pyscf.lib.misc.ThreadRuntimeError: Error on thread <pyscf.lib.misc.call_in_background object at 0x7f5772b63dd0>: CUBLAS_STATUS_EXECUTION_FAILED ########################################################################################

I am using NVIDIA L40 with the pre-compiled version pip3 install gpu4pyscf-cuda12x.

wxj6000 commented 8 months ago

It seems that CuPy didn't find cuBLAS. Can you make sure CUDA Toolkit is installed in your system? If installed, you can check out if cupy.dot works properly.

GiacomoDG96 commented 8 months ago

CUDA Toolkit is installed. When I run nvcc --version I obtain:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Nov__3_17:16:49_PDT_2023 Cuda compilation tools, release 12.3, V12.3.103 Build cuda_12.3.r12.3/compiler.33492891_0

I have also tried cupy.dot with a toy example and it works.

wxj6000 commented 8 months ago

@GiacomoDG96 OK, great. Possibly, GPU doesn't have enough space for cublas handle. Can you try to limit CuPy memory pool? https://docs.cupy.dev/en/stable/user_guide/memory.html#limiting-gpu-memory-usage