Closed Pox-here closed 3 weeks ago
Fixed. Even though I tried for two days setting correct paths and still not working, it was resolved by modified certain cuda paths further to avoid this. Closing since its not candle/lib-related.
@Pox-here , facing the same issue. What was the fix?
@Pox-here I'm also having the same problem. Multiple articles on HF and nvidia-docker claim that GPUs are not accessible during build time. I tried to search for nvidia-smi
in the entire image during build time but nothing was found. I was wondering how did you solve that since bindgen_cuda
clearly looks for nvidia-smi
. Would you mind sharing?
Thanks!
A workaround fix for the buildtime nvidia-smi
call seems to be setting CUDA_COMPUTE_CAP to the compute capability value based on your GPU. This resolved the issue.
Based on: https://github.com/huggingface/candle/issues/1516#issuecomment-1875440701
@andrenatal @sidharthrajaram I see you added a workaround, great! I solved this by using root in the build process. Here is an example quick-made Dockerfile which does work running the candle AI models on host GPU in docker and removes the nvidia-smi not in path issue
, but this is ONLY for dev/testing purpose, please adjust/improve it further for actual use case:)
# Use the NVIDIA CUDA base image
FROM nvidia/cuda:12.3.2-devel-ubuntu22.04
# Set environment variables for the user
ARG USER_ID
ARG USER_NAME
ENV HOME=/home/${USER_NAME}
ENV PATH="${HOME}/.local/bin:${PATH}"
# Install system dependencies
RUN apt-get update -qq \
&& apt-get install -qq -y vim gcc g++ curl git build-essential libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Create a non-root user and assign the user to the system group
RUN groupadd --system --gid ${USER_ID} ${USER_NAME} \
&& useradd --system -m --no-log-init --home-dir ${HOME} --uid ${USER_ID} --gid ${USER_NAME} ${USER_NAME}
# Set ownership of necessary directories
RUN mkdir -p /app /tmp \
&& chown -R ${USER_NAME}:${USER_NAME} ${HOME} /app /tmp
# Switch to the non-root user
USER ${USER_NAME}
# Install Rust with stable toolchain
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable
ENV PATH="${HOME}/.cargo/bin:${PATH}"
# Copy the source code to the /app directory and set ownership
COPY . /app/
RUN chown -R ${USER_NAME}:${USER_NAME} /app
WORKDIR /app/
# Switch back to root for building the application
USER root
# Build the application using Cargo
RUN cargo build --release --bin inference_server
# Define the default command for running the server
CMD ["./target/release/inference_server"]
A simple docker compose to fit with it:
version: "3.8"
x-app-template: &APP_TEMPLATE
user: "${USER_ID:-1000}"
hostname: "${HOST_NAME:-model_hosting_user}"
image: model_hosting
build:
context: .
dockerfile: ./Dockerfile
args:
USER_NAME: "${USER_NAME:-model_hosting_user}"
USER_ID: "${USER_ID:-1000}"
volumes:
- ./:/app/
ipc: host
init: true
services:
app-dev:
<<: *APP_TEMPLATE
container_name: model_hosting_container
ports:
- "8443:8443"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
profiles:
- dev
Thanks for the heads-up @Pox-here !
I'm experiencing a build failure with the candle-kernels library when trying to run an AI application in a Docker environment. The error message indicates that the custom build command failed due to a missing nvidia-smi. However, I'm using the nvidia/cuda:12.3.2-devel-ubuntu22.04 Docker image, and I can confirm that nvidia-smi command and the GPU are operational. The same code operates as intended on GPU on host env.
Error Output:
Anyone experienced similar issue or have a successfully operating Docker running GPU nvidia inference using Candle?