huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
13.79k stars 751 forks source link

Build Failure with candle-kernels: nvidia-smi Not Found in Docker Environment, Even Though It Is Available #2105

Closed Pox-here closed 3 weeks ago

Pox-here commented 3 weeks ago

I'm experiencing a build failure with the candle-kernels library when trying to run an AI application in a Docker environment. The error message indicates that the custom build command failed due to a missing nvidia-smi. However, I'm using the nvidia/cuda:12.3.2-devel-ubuntu22.04 Docker image, and I can confirm that nvidia-smi command and the GPU are operational. The same code operates as intended on GPU on host env.

Error Output:

Compiling candle-kernels v0.5.0 (https://github.com/huggingface/candle.git#2817643d)
error: failed to run custom build command for `candle-kernels v0.5.0 (https://github.com/huggingface/candle.git#2817643d)`
Caused by:
  process didn't exit successfully: `/target/release/build/candle-kernels-1f44ba2b45c1de35/build-script-build` (exit status: 101)
  --- stdout
  cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
  cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
  --- stderr
  thread 'main' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen_cuda-0.1.5/src/lib.rs:489:18:
  `nvidia-smi` failed. Ensure that you have CUDA installed and that `nvidia-smi` is in your PATH.: Os { code: 2, kind: NotFound, message: "No such file or directory" }
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Anyone experienced similar issue or have a successfully operating Docker running GPU nvidia inference using Candle?

Pox-here commented 3 weeks ago

Fixed. Even though I tried for two days setting correct paths and still not working, it was resolved by modified certain cuda paths further to avoid this. Closing since its not candle/lib-related.

sidharthrajaram commented 2 weeks ago

@Pox-here , facing the same issue. What was the fix?

andrenatal commented 2 weeks ago

@Pox-here I'm also having the same problem. Multiple articles on HF and nvidia-docker claim that GPUs are not accessible during build time. I tried to search for nvidia-smi in the entire image during build time but nothing was found. I was wondering how did you solve that since bindgen_cuda clearly looks for nvidia-smi. Would you mind sharing?

Thanks!

Reference: https://github.com/Narsil/bindgen_cuda/issues/4

sidharthrajaram commented 2 weeks ago

A workaround fix for the buildtime nvidia-smi call seems to be setting CUDA_COMPUTE_CAP to the compute capability value based on your GPU. This resolved the issue.

Based on: https://github.com/huggingface/candle/issues/1516#issuecomment-1875440701

Pox-here commented 1 week ago

@andrenatal @sidharthrajaram I see you added a workaround, great! I solved this by using root in the build process. Here is an example quick-made Dockerfile which does work running the candle AI models on host GPU in docker and removes the nvidia-smi not in path issue, but this is ONLY for dev/testing purpose, please adjust/improve it further for actual use case:)

# Use the NVIDIA CUDA base image
FROM nvidia/cuda:12.3.2-devel-ubuntu22.04

# Set environment variables for the user
ARG USER_ID
ARG USER_NAME
ENV HOME=/home/${USER_NAME}
ENV PATH="${HOME}/.local/bin:${PATH}"

# Install system dependencies
RUN apt-get update -qq \
    && apt-get install -qq -y vim gcc g++ curl git build-essential libssl-dev \
    && rm -rf /var/lib/apt/lists/*

# Create a non-root user and assign the user to the system group
RUN groupadd --system --gid ${USER_ID} ${USER_NAME} \
    && useradd --system -m --no-log-init --home-dir ${HOME} --uid ${USER_ID} --gid ${USER_NAME} ${USER_NAME}

# Set ownership of necessary directories
RUN mkdir -p /app /tmp \
    && chown -R ${USER_NAME}:${USER_NAME} ${HOME} /app /tmp

# Switch to the non-root user
USER ${USER_NAME}

# Install Rust with stable toolchain
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable
ENV PATH="${HOME}/.cargo/bin:${PATH}"

# Copy the source code to the /app directory and set ownership
COPY . /app/
RUN chown -R ${USER_NAME}:${USER_NAME} /app
WORKDIR /app/

# Switch back to root for building the application
USER root

# Build the application using Cargo
RUN cargo build --release --bin inference_server

# Define the default command for running the server
CMD ["./target/release/inference_server"]

A simple docker compose to fit with it:

version: "3.8"

x-app-template: &APP_TEMPLATE
  user: "${USER_ID:-1000}"
  hostname: "${HOST_NAME:-model_hosting_user}"
  image: model_hosting
  build:
    context: .
    dockerfile: ./Dockerfile
    args:
      USER_NAME: "${USER_NAME:-model_hosting_user}"
      USER_ID: "${USER_ID:-1000}"
  volumes:
    - ./:/app/
  ipc: host
  init: true

services:
  app-dev:
    <<: *APP_TEMPLATE
    container_name: model_hosting_container
    ports:
      - "8443:8443"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    profiles:
      - dev
andrenatal commented 1 week ago

Thanks for the heads-up @Pox-here !