Closed tylertitsworth closed 1 week ago
Before it gets asked here, yes I have tried to capture logs from within the deployed container, however the container does not even start so no other logs are recorded (other than the liveness probe and queue-proxy failing and all of that)
Thanks for reporting..looking into this. Able to repro the error. Earlier we didn't move to 22.04 as the ubuntu 22.04 runners were flaky. I will try running CI on 22.04 to see if its resolved now.
@tylertitsworth Please pull the submodules before you build kfs image
git submodule update --init --recursive
I am able to build it with 22.04 after doing this
docker image inspect pytorch/torchserve-kfs:latest-cpu | grep "org.opencontainers.image.version"
"org.opencontainers.image.version": "22.04"
@agunapal In the build script I use to build this container I pull submodules (https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L9)
I am able to build the container, however, my issue is when it is deployed to k8s.
@agunapal any update on this? Is there any misunderstanding I can help alleviate?
Hi @tylertitsworth I understand the problem. I will get back to you this week.
On ubuntu 22.04, tried running grpc testcases..these worked
test_gRPC_inference_api.py::test_inference_apis PASSED [ 21%]
test_gRPC_inference_api.py::test_inference_stream_apis 2024-04-06T18:20:11,945 [INFO ] W-9024-echo_stream_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9024-echo_stream_1.0-stderr
PASSED [ 21%]
test_gRPC_inference_api.py::test_inference_stream2_apis PASSED [ 22%]
test_gRPC_management_apis.py::test_management_apis PASSED
So, it may be something specific to docker/kserve.. Will try the steps you have mentioned
This issue has been remediated with the latest version of torchserve
🐛 Describe the bug
The public TorchServe KFS Image that was recently updated for 0.10.0 has
ubuntu:20.04
as its base.Intel is publishing an Intel Optimized version of both the torchserve and torchserve-kfs images, which includes Intel Extension for PyTorch. However, due to Intel's Security First policies, we use
ubuntu:22.04
as our base image for both containers (soon to beubuntu:24.04
.When we deploy with the latest
0.10.0
version of torchserve on kserve, the image immediately enters theCrashLoopBackOff
state due to the following error:exec /usr/local/bin/dockerd-entrypoint.sh: exec format error
.We determined that the solution to this issue was to change the base back to
ubuntu:20.04
, however this means that anyone who intends to create a custom torchserve-kfs container won't be able to use theubuntu:rolling
base specified in https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L19.This issue is not present in the previous version my team published, only with the latest kserve and torchserve version, and I was not able to reproduce from the command line, only in my cluster.
Error logs
When using
ubuntu:23.10
, it fails during buildtime:But I am more interested in the output with
ubuntu:22.04
, which fails during deployment:Installation instructions
Install TorchServe from source? No Are you using Docker? Yes
Model Packaing
n/a
config.properties
n/a
Versions
With
ubuntu:22.04
as baseWith
ubuntu:20.04
as baseRepro instructions
From https://github.com/intel/ai-containers,
docker tag intel/aiops/mlops-ci:b-0-ubuntu-22.04-pip-py3.10-torchserve intel/torchserve:latest
kserve-torchserve
to use the new imagePossible Solution
No response