Open justicel opened 1 week ago
Oh also the logs from the pod when the model was loaded:
Downloading: 100% |========================================| model_meta_list.jsonl-1]
Downloading: 100% |========================================| config.jsoncluster-ml-1]
[2024-10-22T17:40:17,148][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,205][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,210][INFO ][o.o.m.e.i.MLIndicesHandler] [opensearch-cluster-ml-1] create index:.plugins-ml-model
[2024-10-22T17:40:17,217][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,240][INFO ][o.o.m.m.MLModelManager ] [opensearch-cluster-ml-1] create new model meta doc UmxQtZIBu7HrnioNsKS8 for register model task K5NQtZIBOJotuW9ArPuc
[2024-10-22T17:40:17,339][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
Downloading: 100% |========================================| all-MiniLM-L12-v2.zip-1]
[2024-10-22T17:40:33,836][INFO ][o.o.m.m.MLModelManager ] [opensearch-cluster-ml-1] Model registered successfully, model id: UmxQtZIBu7HrnioNsKS8, task id: K5NQtZIBOJotuW9ArPuc
So I finally got it working. The documentation for CUDA is incomplete and outdated. Given that DJI 0.28.0 is now being used, the documentation about CUDA 11.6 and pytorch 1.31.1 is incorrect. It should instead be in the supported list from here:
I was able to use pytorch 2.2.2 and cuda 12.1 to finally get it launched. As well, there's quite a lot missing to make a working Docker image for this. Here's what I have that finally worked:
ARG VERSION=2.17.1
# Supported pytorch versions from here: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md#supported-pytorch-versions
# Note: There is also a cross-section with supported cuda. You can most easily figure it out from the file archive here: https://publish.djl.ai/pytorch/2.2.2/files.txt
ARG PYTORCH_VERSION=2.2.2
FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source
FROM pytorch/pytorch:$PYTORCH_VERSION-cuda12.1-cudnn8-devel
ARG UID=1000
ARG GID=1000
ARG OPENSEARCH_HOME=/usr/share/opensearch
RUN addgroup --gid $GID opensearch \
&& adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch
# Install pytorch components
RUN pip install transformers
COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME
WORKDIR $OPENSEARCH_HOME
RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \
echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \
ls -l $OPENSEARCH_HOME
ENV JAVA_HOME=$OPENSEARCH_HOME/jdk
ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin
ENV PYTORCH_VERSION=$PYTORCH_VERSION
# Add k-NN lib directory to library loading path variable
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"
# Change user
USER $UID
# Setup OpenSearch
# Disable security demo installation during image build, and allow user to disable during startup of the container
# Enable security plugin during image build, and allow user to disable during startup of the container
ARG DISABLE_INSTALL_DEMO_CONFIG=true
ARG DISABLE_SECURITY_PLUGIN=false
RUN ./opensearch-onetime-setup.sh
# Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)
EXPOSE 9200 9300 9600 9650
ARG VERSION
ARG BUILD_DATE
ARG NOTES
# CMD to run
ENTRYPOINT ["./opensearch-docker-entrypoint.sh"]
CMD ["opensearch"]
What is the bug? Attempting to follow the sparse instructions for GPU acceleration using CUDA, here: https://opensearch.org/docs/latest/ml-commons-plugin/gpu-acceleration/
When using the instructions, they are mostly for Neuron, but suggest that you can definitely use CUDA. I built a custom docker image using the attached Dockerfile. It contains opensearch, cuda and pytorch.
However, on attempting to load a pre-built model it never shows as utilizing the GPU (via nvidia-smi output):
The 'no running processes definitely means nothing was loaded in to the GPU.
I'm uncertain if there's simply something missing, or, that this is expected.
How can one reproduce the bug? Steps to reproduce the behavior:
ml
type nodesMODEL_GROUP_ID=$(curl -s -u 'admin:admin' -H 'Content-Type: application/json' -X POST https:///_plugins/_ml/model_groups/_register -d '
{
"name": "test_model_group",
"description": "Default ML model group for opensearch pretrained models"
}' | jq .model_group_id)
curl -u 'admin:admin' -X POST -H 'Content-Type: application/json' https:///_plugins/_ml/models/_register -d "$(cat <<EOF
{
"name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
"version": "1.0.1",
"model_group_id": $MODEL_GROUP_ID,
"model_format": "TORCH_SCRIPT"
}
EOF
)"
ARG VERSION=2.17.1
FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source
FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
ARG UID=1000 ARG GID=1000 ARG OPENSEARCH_HOME=/usr/share/opensearch
RUN addgroup --gid $GID opensearch && \ adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch
COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME WORKDIR $OPENSEARCH_HOME
RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \ echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \ ls -l $OPENSEARCH_HOME
ENV JAVA_HOME=$OPENSEARCH_HOME/jdk ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin
Add k-NN lib directory to library loading path variable
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"
Change user
USER $UID
Setup OpenSearch
Disable security demo installation during image build, and allow user to disable during startup of the container
Enable security plugin during image build, and allow user to disable during startup of the container
ARG DISABLE_INSTALL_DEMO_CONFIG=true ARG DISABLE_SECURITY_PLUGIN=false RUN ./opensearch-onetime-setup.sh
Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)
EXPOSE 9200 9300 9600 9650
ARG VERSION ARG BUILD_DATE ARG NOTES
CMD to run
ENTRYPOINT ["./opensearch-docker-entrypoint.sh"] CMD ["opensearch"]