justicel commented 1 week ago

What is the bug? Attempting to follow the sparse instructions for GPU acceleration using CUDA, here: https://opensearch.org/docs/latest/ml-commons-plugin/gpu-acceleration/

When using the instructions, they are mostly for Neuron, but suggest that you can definitely use CUDA. I built a custom docker image using the attached Dockerfile. It contains opensearch, cuda and pytorch.

However, on attempting to load a pre-built model it never shows as utilizing the GPU (via nvidia-smi output):

opensearch@opensearch-cluster-ml-1:~$ nvidia-smi
Tue Oct 22 17:40:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:4E:00.0 Off |                   On |
| N/A   38C    P0             76W /  700W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0   10   0   0  |              13MiB /  9984MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The 'no running processes definitely means nothing was loaded in to the GPU.

I'm uncertain if there's simply something missing, or, that this is expected.

How can one reproduce the bug? Steps to reproduce the behavior:

Build docker image using included dockerfile
Run a cluster with docker-compose or docker, using the built image and ml type nodes
Attempt to load a model:
```
#!/bin/bash -x
```

MODEL_GROUP_ID=$(curl -s -u 'admin:admin' -H 'Content-Type: application/json' -X POST https:///_plugins/_ml/model_groups/_register -d ' { "name": "test_model_group", "description": "Default ML model group for opensearch pretrained models" }' | jq .model_group_id)

curl -u 'admin:admin' -X POST -H 'Content-Type: application/json' https:///_plugins/_ml/models/_register -d "$(cat <<EOF { "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2", "version": "1.0.1", "model_group_id": $MODEL_GROUP_ID, "model_format": "TORCH_SCRIPT" } EOF )"


**What is the expected behavior?**
The model should be loaded and installed in GPU memory.

**What is your host/environment?**
 - OS: Ubuntu
 - Version:  @20.04

Dockerfile:

ARG VERSION=2.17.1

FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source

FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

ARG UID=1000 ARG GID=1000 ARG OPENSEARCH_HOME=/usr/share/opensearch

RUN addgroup --gid $GID opensearch && \ adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch

COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME WORKDIR $OPENSEARCH_HOME

RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \ echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \ ls -l $OPENSEARCH_HOME

ENV JAVA_HOME=$OPENSEARCH_HOME/jdk ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin

Add k-NN lib directory to library loading path variable

ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"

Change user

USER $UID

Setup OpenSearch

Disable security demo installation during image build, and allow user to disable during startup of the container

Enable security plugin during image build, and allow user to disable during startup of the container

ARG DISABLE_INSTALL_DEMO_CONFIG=true ARG DISABLE_SECURITY_PLUGIN=false RUN ./opensearch-onetime-setup.sh

Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)

EXPOSE 9200 9300 9600 9650

ARG VERSION ARG BUILD_DATE ARG NOTES

CMD to run

ENTRYPOINT ["./opensearch-docker-entrypoint.sh"] CMD ["opensearch"]



Additional note: Hopefully a newer version of pytorch and cuda are actually supported?

justicel commented 1 week ago

Oh also the logs from the pod when the model was loaded:

Downloading: 100% |========================================| model_meta_list.jsonl-1]
Downloading: 100% |========================================| config.jsoncluster-ml-1]
[2024-10-22T17:40:17,148][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,205][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,210][INFO ][o.o.m.e.i.MLIndicesHandler] [opensearch-cluster-ml-1] create index:.plugins-ml-model
[2024-10-22T17:40:17,217][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,240][INFO ][o.o.m.m.MLModelManager   ] [opensearch-cluster-ml-1] create new model meta doc UmxQtZIBu7HrnioNsKS8 for register model task K5NQtZIBOJotuW9ArPuc
[2024-10-22T17:40:17,339][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
Downloading: 100% |========================================| all-MiniLM-L12-v2.zip-1]
[2024-10-22T17:40:33,836][INFO ][o.o.m.m.MLModelManager   ] [opensearch-cluster-ml-1] Model registered successfully, model id: UmxQtZIBu7HrnioNsKS8, task id: K5NQtZIBOJotuW9ArPuc

justicel commented 1 week ago

So I finally got it working. The documentation for CUDA is incomplete and outdated. Given that DJI 0.28.0 is now being used, the documentation about CUDA 11.6 and pytorch 1.31.1 is incorrect. It should instead be in the supported list from here:

https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md#supported-pytorch-versions

I was able to use pytorch 2.2.2 and cuda 12.1 to finally get it launched. As well, there's quite a lot missing to make a working Docker image for this. Here's what I have that finally worked:

ARG VERSION=2.17.1

# Supported pytorch versions from here: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md#supported-pytorch-versions
# Note: There is also a cross-section with supported cuda. You can most easily figure it out from the file archive here: https://publish.djl.ai/pytorch/2.2.2/files.txt
ARG PYTORCH_VERSION=2.2.2

FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source

FROM pytorch/pytorch:$PYTORCH_VERSION-cuda12.1-cudnn8-devel

ARG UID=1000
ARG GID=1000
ARG OPENSEARCH_HOME=/usr/share/opensearch

RUN addgroup --gid $GID opensearch \
  && adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch

# Install pytorch components

RUN pip install transformers

COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME
WORKDIR $OPENSEARCH_HOME

RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \
    echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \
    ls -l $OPENSEARCH_HOME

ENV JAVA_HOME=$OPENSEARCH_HOME/jdk
ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin
ENV PYTORCH_VERSION=$PYTORCH_VERSION

# Add k-NN lib directory to library loading path variable
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"

# Change user
USER $UID

# Setup OpenSearch
# Disable security demo installation during image build, and allow user to disable during startup of the container
# Enable security plugin during image build, and allow user to disable during startup of the container
ARG DISABLE_INSTALL_DEMO_CONFIG=true
ARG DISABLE_SECURITY_PLUGIN=false
RUN ./opensearch-onetime-setup.sh

# Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)
EXPOSE 9200 9300 9600 9650

ARG VERSION
ARG BUILD_DATE
ARG NOTES

# CMD to run
ENTRYPOINT ["./opensearch-docker-entrypoint.sh"]
CMD ["opensearch"]

opensearch-project / ml-commons

[BUG] GPU Acceleration not functioning with CUDA #3138

Add k-NN lib directory to library loading path variable

Change user

Setup OpenSearch

Disable security demo installation during image build, and allow user to disable during startup of the container

Enable security plugin during image build, and allow user to disable during startup of the container

Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)

CMD to run