google-research / multinerf

A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF
Apache License 2.0
3.56k stars 338 forks source link

Between "No GPU/TPU found, falling back to CPU." and "failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error" #157

Open sbisw002 opened 3 months ago

sbisw002 commented 3 months ago

I am trying to get my new ThinkPad with "NVIDIA RTX 4000 Ada 12 GB" graphics card going.

No matter what "cuda-driver(12.4)+cudnn+jax+jaxlib" combination I try, the best results are either a)"No GPU/TPU found, falling back to CPU." or b)"failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error"

When I run Data Sampler section from https://github.com/PredictiveIntelligenceLab/ImprovedDeepONets/blob/main/Stokes/PI_DeepONet_Stokes.ipynb

I get errors like:

a)
Installation: pip install jaxlib==0.4.7+cuda12.cudnn88 jax==0.4.7 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Run: runfile('/home/saumya/NeuralN/Op Net/ImprovedDeepONets/Stokes/PI_DeepONet_Stokes-Copy1', wdir='/home/saumya/NeuralN/Op Net/ImprovedDeepONets/Stokes') 2024-03-19 11:48:27.682846: I external/xla/xla/service/service.cc:168] XLA service 0x8dd95c0 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices: 2024-03-19 11:48:27.682867: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Interpreter, 2024-03-19 11:48:27.689135: I external/xla/xla/pjrt/tfrt_cpu_pjrt_client.cc:218] TfrtCpuClient created. 2024-03-19 11:48:29.450971: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error 2024-03-19 11:48:29.450988: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: saumya-TP-GPU 2024-03-19 11:48:29.450991: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: saumya-TP-GPU 2024-03-19 11:48:29.451052: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 550.54.14 2024-03-19 11:48:29.451064: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 550.54.14 Release Build (dvs-builder@U16-A24-2-2) Thu Feb 22 01:44:50 UTC 2024 GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04) " No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

b) Installation: pip install jaxlib==0.4.9+cuda12.cudnn88 jax==0.4.9 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Run: 2024-03-19 12:10:31.130411: I external/xla/xla/service/service.cc:168] XLA service 0x6a1d490 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices: 2024-03-19 12:10:31.130427: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Interpreter, 2024-03-19 12:10:31.134477: I external/xla/xla/pjrt/tfrt_cpu_pjrt_client.cc:433] TfrtCpuClient created. 2024-03-19 12:10:50.428065: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error 2024-03-19 12:10:50.428083: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: saumya-TP-GPU 2024-03-19 12:10:50.428086: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: saumya-TP-GPU 2024-03-19 12:10:50.428143: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 550.54.14 2024-03-19 12:10:50.428156: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 550.54.14 Release Build (dvs-builder@U16-A24-2-2) Thu Feb 22 01:44:50 UTC 2024 GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04) " No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

My system: $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0

$ nvidia-smi Tue Mar 19 12:21:40 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 ERR! Off | 00000000:01:00.0 N/A | N/A | |ERR! ERR! ERR! N/A / N/A | 14MiB / 12282MiB | N/A Default | | | | ERR! | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

python version $ whereis python | tr ' ' '\n' | grep ^/ | sort /home/saumya/anaconda3/envs/OpNet/bin/python $ python --version && python3 --version Python 3.9.18 Python 3.9.18