oracle-samples / oci-data-science-ai-samples

This repo contains a series of tutorials and code examples highlighting different features of the OCI Data Science and AI services, along with a release vehicle for experimental programs.
Universal Permissive License v1.0
165 stars 162 forks source link

Error with CUDA image #363

Open luissimoesneom opened 9 months ago

luissimoesneom commented 9 months ago

We have tried to create a new docker container starting by using the docker image that is using on the vLLM example given in this repo and we got the below error:

Errors occurred while bootstrapping the Model Deployment: Start Container Error: unable to start container: Error response from daemon: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown

Used image: FROM nvidia/cuda:11.8.0-base-ubuntu20.04 as base

What is exactly the issue? How is it supposed for the Model Deployment example using vLLM to work?

Thank you

RodrigoDiasDeOliveira commented 2 weeks ago

i thought it can helps.. To resolve this issue, you have two main options:

Update your NVIDIA driver: Install the latest NVIDIA driver on your host system that supports CUDA 11.8 or later. Use an earlier CUDA container: If updating the driver is not possible, modify your Dockerfile to use an earlier CUDA version that's compatible with your current driver. Steps to Resolve Check your current NVIDIA driver version: nvidia-smi If updating the driver, visit the NVIDIA driver download page and install the latest version for your GPU. If using an earlier CUDA version, modify your Dockerfile: FROM nvidia/cuda:11.7.0-base-ubuntu20.04 as base (Or an even earlier version if needed) Ensure you have the NVIDIA Container Toolkit installed: sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker

Rebuild your Docker image and try running the container again. Additional Troubleshooting If you continue to face issues:

Check the compatibility matrix between CUDA versions and NVIDIA driver versions. Verify that your GPU supports the CUDA version you're trying to use. Ensure that the NVIDIA Container Toolkit is correctly installed and configured. Try running a simple CUDA container to isolate whether the issue is specific to your vLLM setup or a general CUDA/Docker configuration problem: docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu20.04 nvidia-smi