pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.15k stars 837 forks source link

Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error #3033

Closed tylertitsworth closed 1 week ago

tylertitsworth commented 5 months ago

🐛 Describe the bug

The public TorchServe KFS Image that was recently updated for 0.10.0 has ubuntu:20.04 as its base.

$ docker image inspect pytorch/torchserve-kfs:0.10.0 | grep "org.opencontainers.image.version"
                "org.opencontainers.image.version": "20.04"

Intel is publishing an Intel Optimized version of both the torchserve and torchserve-kfs images, which includes Intel Extension for PyTorch. However, due to Intel's Security First policies, we use ubuntu:22.04 as our base image for both containers (soon to be ubuntu:24.04.

When we deploy with the latest 0.10.0 version of torchserve on kserve, the image immediately enters the CrashLoopBackOff state due to the following error: exec /usr/local/bin/dockerd-entrypoint.sh: exec format error.

We determined that the solution to this issue was to change the base back to ubuntu:20.04, however this means that anyone who intends to create a custom torchserve-kfs container won't be able to use the ubuntu:rolling base specified in https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L19.

This issue is not present in the previous version my team published, only with the latest kserve and torchserve version, and I was not able to reproduce from the command line, only in my cluster.

Error logs

When using ubuntu:23.10, it fails during buildtime:

$ ./build-image.sh
...
#11 4.706   Downloading grpcio-tools-1.48.2.tar.gz (2.2 MB)
#11 4.827      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 18.7 MB/s eta 0:00:00
#11 5.054   Preparing metadata (setup.py): started
#11 5.230   Preparing metadata (setup.py): finished with status 'error'
#11 5.234   error: subprocess-exited-with-error
#11 5.234   
#11 5.234   × python setup.py egg_info did not run successfully.
#11 5.234   │ exit code: 1
#11 5.234   ╰─> [16 lines of output]
#11 5.234       /home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py:30: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
#11 5.234         import pkg_resources
#11 5.234       Traceback (most recent call last):
#11 5.234         File "<string>", line 2, in <module>
#11 5.234         File "<pip-setuptools-caller>", line 34, in <module>
#11 5.234         File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 180, in <module>
#11 5.234           if check_linker_need_libatomic():
#11 5.234              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234         File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 91, in check_linker_need_libatomic
#11 5.234           cpp_test = subprocess.Popen([cxx, '-x', 'c++', '-std=c++14', '-'],
#11 5.234                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234         File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
#11 5.234           self._execute_child(args, executable, preexec_fn, close_fds,
#11 5.234         File "/usr/lib/python3.11/subprocess.py", line 1950, in _execute_child
#11 5.234           raise child_exception_type(errno_num, err_msg, err_filename)
#11 5.234       FileNotFoundError: [Errno 2] No such file or directory: 'c++'
#11 5.234       [end of output]
#11 5.234   
#11 5.234   note: This error originates from a subprocess, and is likely not a problem with pip.
#11 5.236 error: metadata-generation-failed
#11 5.236 
#11 5.236 × Encountered error while generating package metadata.
#11 5.236 ╰─> See above for output.
#11 5.236 
#11 5.236 note: This is an issue with the package mentioned above, not pip.
#11 5.236 hint: See above for details.
------
executor failed running [/bin/bash -c pip install -r requirements.txt]: exit code: 1

But I am more interested in the output with ubuntu:22.04, which fails during deployment:

$ kubectl logs vqi-predictor-00001-deployment-8f6cd7bd7-9hl84
Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init)
exec /usr/local/bin/dockerd-entrypoint.sh: exec format error

Installation instructions

Install TorchServe from source? No Are you using Docker? Yes

Model Packaing

n/a

config.properties

n/a

Versions

With ubuntu:22.04 as base

$ python ts_scripts/print_env_info.py 
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.10.0
torch-model-archiver==0.10.0

Python version: 3.10 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.26.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu

Java Version:

OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A

Environment:
library_path (LD_/DYLD_):

With ubuntu:20.04 as base

python ts_scripts/print_env_info.py 
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.10.0
torch-model-archiver==0.10.0

Python version: 3.8 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.24.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu

Java Version:

OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A

Environment:
library_path (LD_/DYLD_):

Repro instructions

From https://github.com/intel/ai-containers,

  1. Clone the Repository
  2. Install docker-compose (see main README.md)
  3. Build the Intel TorchServe container:
    export REGISTRY=intel
    export REPO=aiops/mlops-ci
    cd pytorch
    docker compose up --build torchserve
  4. Setup KServe build
    1. Comment out these lines https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L4-L5
    2. docker tag intel/aiops/mlops-ci:b-0-ubuntu-22.04-pip-py3.10-torchserve intel/torchserve:latest
  5. Build KServe Container
    cd serving
    ./build-kfs.sh
  6. Push to Internal Registry
  7. Modify ClusterServingRuntime kserve-torchserve to use the new image
  8. Deploy any example Endpoint

Possible Solution

No response

tylertitsworth commented 5 months ago

Before it gets asked here, yes I have tried to capture logs from within the deployed container, however the container does not even start so no other logs are recorded (other than the liveness probe and queue-proxy failing and all of that)

agunapal commented 5 months ago

Thanks for reporting..looking into this. Able to repro the error. Earlier we didn't move to 22.04 as the ubuntu 22.04 runners were flaky. I will try running CI on 22.04 to see if its resolved now.

agunapal commented 5 months ago

@tylertitsworth Please pull the submodules before you build kfs image

git submodule update --init --recursive

I am able to build it with 22.04 after doing this

 docker image inspect pytorch/torchserve-kfs:latest-cpu | grep "org.opencontainers.image.version"
                "org.opencontainers.image.version": "22.04"
tylertitsworth commented 5 months ago

@agunapal In the build script I use to build this container I pull submodules (https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L9)

I am able to build the container, however, my issue is when it is deployed to k8s.

tylertitsworth commented 5 months ago

@agunapal any update on this? Is there any misunderstanding I can help alleviate?

agunapal commented 5 months ago

Hi @tylertitsworth I understand the problem. I will get back to you this week.

agunapal commented 5 months ago

On ubuntu 22.04, tried running grpc testcases..these worked

test_gRPC_inference_api.py::test_inference_apis PASSED                                                                                                                             [ 21%]
test_gRPC_inference_api.py::test_inference_stream_apis 2024-04-06T18:20:11,945 [INFO ] W-9024-echo_stream_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9024-echo_stream_1.0-stderr
PASSED                                                                                                                      [ 21%]
test_gRPC_inference_api.py::test_inference_stream2_apis PASSED                                                                                                                     [ 22%]
test_gRPC_management_apis.py::test_management_apis PASSED        

So, it may be something specific to docker/kserve.. Will try the steps you have mentioned

tylertitsworth commented 1 week ago

This issue has been remediated with the latest version of torchserve