Build Failure with candle-kernels: nvidia-smi Not Found in Docker Environment, Even Though It Is Available #2105

Pox-here commented 3 weeks ago

I'm experiencing a build failure with the candle-kernels library when trying to run an AI application in a Docker environment. The error message indicates that the custom build command failed due to a missing nvidia-smi. However, I'm using the nvidia/cuda:12.3.2-devel-ubuntu22.04 Docker image, and I can confirm that nvidia-smi command and the GPU are operational. The same code operates as intended on GPU on host env.

Error Output:

Compiling candle-kernels v0.5.0 (
error: failed to run custom build command for `candle-kernels v0.5.0 (`
Caused by:
  process didn't exit successfully: `/target/release/build/candle-kernels-1f44ba2b45c1de35/build-script-build` (exit status: 101)
  --- stdout
  cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
  --- stderr
  thread 'main' panicked at /root/.cargo/registry/src/
  `nvidia-smi` failed. Ensure that you have CUDA installed and that `nvidia-smi` is in your PATH.: Os { code: 2, kind: NotFound, message: "No such file or directory" }
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Anyone experienced similar issue or have a successfully operating Docker running GPU nvidia inference using Candle?

Pox-here commented 3 weeks ago

Fixed. Even though I tried for two days setting correct paths and still not working, it was resolved by modified certain cuda paths further to avoid this. Closing since its not candle/lib-related.

sidharthrajaram commented 2 weeks ago

@Pox-here , facing the same issue. What was the fix?

andrenatal commented 2 weeks ago

@Pox-here I'm also having the same problem. Multiple articles on HF and nvidia-docker claim that GPUs are not accessible during build time. I tried to search for nvidia-smi in the entire image during build time but nothing was found. I was wondering how did you solve that since bindgen_cuda clearly looks for nvidia-smi. Would you mind sharing?



sidharthrajaram commented 2 weeks ago

A workaround fix for the buildtime nvidia-smi call seems to be setting CUDA_COMPUTE_CAP to the compute capability value based on your GPU. This resolved the issue.

Based on:

Pox-here commented 1 week ago

@andrenatal @sidharthrajaram I see you added a workaround, great! I solved this by using root in the build process. Here is an example quick-made Dockerfile which does work running the candle AI models on host GPU in docker and removes the nvidia-smi not in path issue, but this is ONLY for dev/testing purpose, please adjust/improve it further for actual use case:)

# Use the NVIDIA CUDA base image
FROM nvidia/cuda:12.3.2-devel-ubuntu22.04

# Set environment variables for the user
ENV PATH="${HOME}/.local/bin:${PATH}"

# Install system dependencies
RUN apt-get update -qq \
    && apt-get install -qq -y vim gcc g++ curl git build-essential libssl-dev \
    && rm -rf /var/lib/apt/lists/*

# Create a non-root user and assign the user to the system group
RUN groupadd --system --gid ${USER_ID} ${USER_NAME} \
    && useradd --system -m --no-log-init --home-dir ${HOME} --uid ${USER_ID} --gid ${USER_NAME} ${USER_NAME}

# Set ownership of necessary directories
RUN mkdir -p /app /tmp \
    && chown -R ${USER_NAME}:${USER_NAME} ${HOME} /app /tmp

# Switch to the non-root user

# Install Rust with stable toolchain
RUN curl --proto '=https' --tlsv1.2 -sSf | sh -s -- -y --default-toolchain stable
ENV PATH="${HOME}/.cargo/bin:${PATH}"

# Copy the source code to the /app directory and set ownership
COPY . /app/
RUN chown -R ${USER_NAME}:${USER_NAME} /app

# Switch back to root for building the application
USER root

# Build the application using Cargo
RUN cargo build --release --bin inference_server

# Define the default command for running the server
CMD ["./target/release/inference_server"]

A simple docker compose to fit with it:

version: "3.8"

x-app-template: &APP_TEMPLATE
  user: "${USER_ID:-1000}"
  hostname: "${HOST_NAME:-model_hosting_user}"
  image: model_hosting
    context: .
    dockerfile: ./Dockerfile
      USER_NAME: "${USER_NAME:-model_hosting_user}"
      USER_ID: "${USER_ID:-1000}"
    - ./:/app/
  ipc: host
  init: true

    container_name: model_hosting_container
      - "8443:8443"
            - driver: nvidia
              count: all
              capabilities: [gpu]
      - dev
andrenatal commented 1 week ago

Thanks for the heads-up @Pox-here !