Closed pranavpsv closed 4 years ago
@pranavpsv I'll investigate this further. Is there a specific dockerfile you are using for the Sagemaker container?
@pranavpsv I'll investigate this further. Is there a specific dockerfile you are using for the Sagemaker container?
FROM ubuntu:18.04
ENV PYTHONUNBUFFERED TRUE
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
fakeroot \
ca-certificates \
dpkg-dev \
g++ \
python3-dev \
openjdk-11-jdk \
curl \
vim \
&& rm -rf /var/lib/apt/lists/* \
&& cd /tmp \
&& curl -O https://bootstrap.pypa.io/get-pip.py \
&& python3 get-pip.py
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1
RUN pip install --no-cache-dir psutil \
--no-cache-dir torch \
--no-cache-dir torchvision
ADD serve serve
RUN pip install ../serve/
COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh
RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh
RUN mkdir -p /opt/ml/model # This line was added to fix the first error on CloudWatch which said model-store not found: /opt/ml/model
RUN mkdir -p /home/model-server/ && mkdir -p /home/model-server/tmp
COPY config.properties /home/model-server/config.properties
WORKDIR /home/model-server
ENV TEMP=/home/model-server/tmp
ENTRYPOINT ["/usr/local/bin/dockerd-entrypoint.sh"]
CMD ["serve"]
Thank you, above is the contents of my dockerfile. I added a label at the top for multi-model support. Otherwise, it is the same dockerfile found on this link.
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
number_of_netty_threads=32
job_queue_size=1000
model_store=/opt/ml/model
Same as above
from sagemaker.model import Model
from sagemaker.multidatamodel import MultiDataModel
from sagemaker.predictor import RealTimePredictor
model_data_prefix = f's3://{bucket_name}/{prefix}/'
model_data = f's3://{bucket_name}/{prefix}/models/{model_file_name}.tar.gz'
sm_model_name = 'torchserve-densenet161'
torchserve_model = Model(model_data = model_data,
image = image,
role = role,
predictor_cls=RealTimePredictor,
name = sm_model_name)
multi_model = MultiDataModel(name = sm_model_name,
model_data_prefix = model_data_prefix,
model = torchserve_model)
multi_model.add_model(torchserve_model.model_data)
endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
predictor = multi_model.deploy(instance_type='ml.m4.xlarge',
initial_instance_count=1,
endpoint_name = endpoint_name)
Was able to spin up an endpoint with the above changes. However, the model doesn't get registered on server spin up. Running inference causes the following exception
ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Endpoint torchserve-endpoint-2020-05-28-23-36-41 is not a multi-model endpoint and does not support target model header.
Will reach out to the Sagemaker team.
@pranavpsv MultiModel BYOCs have an explicit dependency on MultiModelServer. Supporting TorchServe would require changes from the Sagemaker team
@pranavpsv MultiModel BYOCs have an explicit dependency on MultiModelServer. Supporting TorchServe would require changes from the Sagemaker team
Thank you for the information and update!
@maaquib Could you raise this with that team? Is there anything that can be done from our end to help expedite this process?
We would like to know the status of this issue. Is this feature/fix in Sagemaker team's near-term roadmap?
@sivashakthi TorchServe is now the default inference server for PyTorch models in SageMaker. It should work. Please let us know if you run into any issues.
I'm not sure if this is the right place to ask, but is there support for Multi-Model-Endpoint deployment with SageMaker for Torchserve? Using TorchServe, I tried creating a Multi-Model endpoint to deploy 2 Torch models placed in an S3 Bucket. I used sagemaker.multidatamodel.MultiDataModel to create this endpoint but deployment fails saying "The primary container for production variant AllTraffic did not pass the ping health check." When, I check CloudWatch logs, there's an error message: "ACCESS_LOG ... "GET /models HTTP/1.1" 404 0"
How do I deploy multiple-models to an endpoint on AWS SageMaker through TorchServe?
Note: Torchserve Single Model endpoint creation works well with SageMaker (when following this link)