ucbrise / clipper

A low-latency prediction-serving system
http://clipper.ai
Apache License 2.0
1.4k stars 280 forks source link

Support deploying models with GPU access #338

Open dcrankshaw opened 6 years ago

dcrankshaw commented 6 years ago

For Kubernetes, we can use the experimental GPU support feature: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

For Docker, we can use nvidia-docker.

santi81 commented 6 years ago

@dcrankshaw i have done some work on this ...i would like to pick this up if its fine with you guys

dcrankshaw commented 6 years ago

Sure that would be great. Have you worked with the Kubernetes GPU support? Go ahead and assign the issue to yourself.

lgendrot commented 6 years ago

I've been paying rather close attention to this issue, so I'm just wondering if there's been any behind-the-scenes movement on it? Feels like a major value add for clipper.

dcrankshaw commented 6 years ago

I just implemented this. It still needs a bit of testing, but I should have a PR up by the end of the week.

On Sat, Mar 10, 2018 at 7:17 PM, Luc Gendrot notifications@github.com wrote:

I've been paying rather close attention to this issue, so I'm just wondering if there's been any behind-the-scenes movement on it? Feels like a major value add for clipper.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ucbrise/clipper/issues/338#issuecomment-372086125, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaV5JDwcIqyhK4qhVVGZUjizvNpdNVtks5tdJdAgaJpZM4Qvhcd .

robi56 commented 6 years ago

Hi @dcrankshaw, I have tried by using nvidia-docker. I have installed nvidia-docker package in my local machine and start the docker where model servers using nvidia-docker container_id to access the gpu resources from the machine. But the model-server doesn't get gpu access.

simon-mo commented 6 years ago

For the latest nvidia-docker I believe you need to pass in runtime=“nvidia” in docker.containers.run

cwtan501 commented 6 years ago

After weeks of researching and trial-and-error, I have finally got the Clipper to work with GPU and TensorFlow. I think it is worth to share my little experience with those we are also looking at this issue. I would try my very best to make the steps clear and concise, as summarized as follows:

1. In order to allow the GPU support for Clipper, you would first need to install nvidia docker Detailed steps you are advised to refer to: https://github.com/NVIDIA/nvidia-docker
2. Build your own nvidia docker image, which will be served as a base image when you build and deploy your clipper. I have referred to the following: 
    ◦ https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile
    ◦ https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/runtime/Dockerfile
    and construct my own docker file to build a nvidia docker image with cuda runtime. Please       do not mind to overwrite the default PATH and LD_LIBRARY_PATH which I observed      that they were not pointing to the right folders. Instead, use the following values in case:

ENV PATH /usr/local/cuda:${PATH} ENV LD_LIBRARY_PATH /usr/local/cuda/lib64

    Also, you are expected to install the following packages:
• Python3
• python3-pip
• libzmq5
• redis-server
• libsodium18
• build-essential

    and also the python packages:
• cloudpickle
• pyzmq
• prometheus_client
• pyyaml
• jsonschema
• redis
• psutil
• flask
• numpy

3. Next, please ensure you have also installed the cuDNN (please refer to: https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html). In my case, as I already have those required files in my host machine, what I need to do is just to copy the files over to the docker image. 
4. Make a first-tier directory in your docker image, and name it as container
5. Copy the following files from your host to the /container in docker image:
                                   **COPY containers/python/__init__.py containers/python/tf_container.py containers/python/container_entry.sh containers/python/rpc.py /container/
                                   COPY monitoring/metrics_config.yaml /container/**

In case you doubt where to get those files, here is the link: https://github.com/ucbrise/clipper/tree/develop/containers/python Next, make some minor revision to rpc.py at line 757: From: cmd = ['python', '-m', 'clipper_admin.metrics.server'] To: cmd = ['python3', '-m', 'clipper_admin.metrics.server']

6. Upgrade pip3 to the newer version: 
        RUN pip3 install --upgrade pip
7. Install tensorflow-gpu and clipper_admin
8. Set the following:

**ENV CLIPPER_MODEL_PATH=/model

CMD ["/container/container_entry.sh", "tensorflow-container", "/container/tf_container.py"] HEALTHCHECK --interval=3s --timeout=3s --retries=1 CMD test -f /model_is_ready.check || exit 1**

Note: the HEALTHCHECK statement is important as the clipper_admin would need such information when starting your model.

9. Modify the /etc/docker/daemon.json by adding the following entry:
    **"default-runtime":"nvidia",**

and then restart the docker service to make the above configuration effective. 

Now, you are ready to kick-start the Clipper with GPU support. Hope that the aforementioned steps are useful to you all. I have also provided a docker template here: https://github.com/cwtan501/nvidia_tf_template

wcwang07 commented 5 years ago

Hi @cwtan501 I tried to run your template from aws p2 instance but it failed to build docker image at

Step 22/35 : RUN apt-get update && apt-get install -y --no-install-recommends cuda-libraries-$CUDA_PKG_VERSION cuda-cublas-9-0=9.0.176.4-1 libnccl2=$NCCL_VERSION-1+cuda9.0 && apt-mark hold libnccl2 && rm -rf /var/lib/apt/lists/ ---> Using cache ---> b272447bbe1b Step 23/35 : RUN mkdir -p /usr/local/cuda/include ---> Using cache ---> 7ab576eadefb Step 24/35 : COPY /cuda/include/ /usr/local/cuda/include/ COPY failed: no source files were specified

simon-mo commented 5 years ago

Now nvidia provides cuda docker images, we can try:

FROM nvidia/cuda:9.2-cudnn7-runtime

# alias python3 -> python
RUN echo '#!/bin/bash\npython3 "$@"' > /usr/bin/python && \
    chmod +x /usr/bin/python

# install python dependencies
RUN pip3 install cloudpickle==0.5.* pyzmq==17.0.* requests==2.18.* scikit-learn==0.19.* \
  numpy==1.14.* pyyaml==3.12.* docker==3.1.* kubernetes==5.0.* tensorflow==1.6.* mxnet==1.1.* pyspark==2.3.* \
  xgboost==0.7.*

# install binary dependencies
RUN mkdir -p /model \
      && apt-get update -qq \
      && apt-get install -y -qq libzmq5 libzmq5-dev redis-server libsodium18 build-essential

# make sure you run this inside clipper directory
COPY clipper_admin /clipper_admin/

RUN cd /clipper_admin \
    && pip install -q .

WORKDIR /container

COPY containers/python/__init__.py containers/python/rpc.py /container/

COPY monitoring/metrics_config.yaml /container/

ENV CLIPPER_MODEL_PATH=/model

HEALTHCHECK --interval=3s --timeout=3s --retries=1 CMD test -f /model_is_ready.check || exit 1

RUN pip install -q tensorflow==1.6.*

COPY containers/python/tf_container.py containers/python/container_entry.sh /container/

CMD ["/container/container_entry.sh", "tensorflow-container", "/container/tf_container.py"]

Make sure you run docker build inside clipper directory, just git clone should do.

RehanSD commented 5 years ago

Hi @wcwang07 ! Clipper is adding native support for PyTorch and TF on CUDA 10! I've made a PR adding support for PyTorch + CUDA 10 on docker, and will be rolling out TF support soon. This can be run on an AWS p2 instance. Make sure to choose the Deep Learning AMI (Ubuntu) Version 21

wcwang07 commented 5 years ago

@simon-mo @RehanSD I was using FROM tensorflow/tensorflow:latest-gpu-py3 this image seems to resolve issue with finding GPU:0 device

wcwang07 commented 5 years ago

@simon-mo ran this new gpu container with following stats:

recv: 0.000223 s, parse: 0.000013 s, handle: 0.157390 s

check it out at

docker pull wcwang07/test-gpu-container

clipper_conn.register_application(name="hello-tf", input_type="int", default_output="this is default output", slo_micros=3000000)

https://gist.github.com/wcwang07/aef2d54c134f7c43e726bf9d027770c9

python_deployer.deploy_tensorflow_model(clipper_conn=clipper_conn, name="tf-mobilnet", version=1, input_type="int", func=predict, tf_sess_or_saved_model_path='***',base_image='test-gpu-container',pkgs_to_install=['pillow'])

clipper_conn.link_model_to_app(app_name="hello-tf", model_name="tf-mobilnet")

RehanSD commented 5 years ago

Docker is addressed in this PR #669