pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

Example model failed to register: Failed to start workers for model deeplabv3_resnet_101 version: 1.0 #1427

Closed Aprilmhx closed 2 years ago

Aprilmhx commented 2 years ago

Hello!

After I successfully installed the torchserve environment using docker, I tried the examples/image_segmenter that comes with torchserve, but I was never able to successfully register the model.

My local environment libraries are:

Your Environment

Expected Behavior

Based on the docker environment using examples/image_segmenter in torchserve to successfully register the model and get the return result of persons.jpg

Current Behavior

After registering the model with this command:

sudo curl -X POST "http://127.0.0.1:7001/models?initial_workers=2&url=fcn_resnet_101.mar"

get the Error:

{ "code": 500, "type": "InternalServerException", "message": "Failed to start workers for model fcn_resnet_101 version: 1.0" }

Possible Solution

No idea

Steps to Reproduce

  1. git clone https://github.com/pytorch/serve.git
  2. wget https://download.pytorch.org/models/fcn_resnet101_coco-7ecb50ca.pth
  3. sudo docker run --rm -it -p 7000:8080 -p 7001:8081 -v $(pwd)/serve/examples:/home/model-server/model-store pytorch/torchserve:latest
  4. sudo torch-model-archiver --model-name fcn_resnet_101 --version 1.0 --model-file image_segmenter/fcn/model.py --serialized-file image_segmenter/fcn_resnet101_coco-7ecb50ca.pth --handler image_segmenter --extra-files image_segmenter/fcn/fcn.py,image_segmenter/fcn/intermediate_layer_getter.py
  5. sudo curl -X POST "http://127.0.0.1:7001/models?initial_workers=2&url=fcn_resnet_101.mar"

Failure Logs [if any]

Unable to find image 'pytorch/torchserve:latest' locally latest: Pulling from pytorch/torchserve 284055322776: Pull complete bf7640766b3b: Pull complete d05665a60e73: Pull complete 85824628b9b8: Pull complete d93240f6b9fe: Pull complete 4f4fb700ef54: Pull complete Digest: sha256:52ce3f86274bc92aec7a73702358323724097a75cea6d60ac39cd5f445bf727e Status: Downloaded newer image for pytorch/torchserve:latest WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2022-02-14T13:06:36,056 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2022-02-14T13:06:36,184 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.5.2 TS Home: /usr/local/lib/python3.6/dist-packages Current directory: /home/model-server Temp directory: /home/model-server/tmp Number of GPUs: 0 Number of CPUs: 8 Max heap size: 8042 M Python executable: /usr/bin/python3 Config file: /home/model-server/config.properties Inference address: http://0.0.0.0:8080 Management address: http://0.0.0.0:8081 Metrics address: http://0.0.0.0:8082 Model Store: /home/model-server/model-store Initial Models: N/A Log dir: /home/model-server/logs Metrics dir: /home/model-server/logs Netty threads: 32 Netty client threads: 0 Default workers per model: 8 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?://.] Custom python dependency for model allowed: false Metrics report format: prometheus Enable metrics API: true Workflow Store: /home/model-server/model-store Model config: N/A 2022-02-14T13:06:36,191 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2022-02-14T13:06:36,217 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2022-02-14T13:06:36,280 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080 2022-02-14T13:06:36,280 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2022-02-14T13:06:36,281 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081 2022-02-14T13:06:36,282 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2022-02-14T13:06:36,282 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 Model server started. 2022-02-14T13:06:36,564 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644843996 2022-02-14T13:06:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.49948501586914|#Level:Host|#hostname:732c3c9747d3,timestamp:1644843996 2022-02-14T13:06:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.740341186523438|#Level:Host|#hostname:732c3c9747d3,timestamp:1644843996 2022-02-14T13:06:36,566 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644843996 2022-02-14T13:06:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29495.76171875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644843996 2022-02-14T13:06:36,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2243.5546875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644843996 2022-02-14T13:06:36,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.3|#Level:Host|#hostname:732c3c9747d3,timestamp:1644843996 2022-02-14T13:07:36,565 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844056 2022-02-14T13:07:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.499473571777344|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844056 2022-02-14T13:07:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.740352630615234|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844056 2022-02-14T13:07:36,566 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844056 2022-02-14T13:07:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29495.1875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844056 2022-02-14T13:07:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2246.3671875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844056 2022-02-14T13:07:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.3|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844056 2022-02-14T13:08:36,560 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844116 2022-02-14T13:08:36,561 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.499465942382812|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844116 2022-02-14T13:08:36,561 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.740360260009766|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844116 2022-02-14T13:08:36,561 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844116 2022-02-14T13:08:36,561 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29493.85546875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844116 2022-02-14T13:08:36,562 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2247.7421875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844116 2022-02-14T13:08:36,562 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.3|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844116 2022-02-14T13:09:36,561 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844176 2022-02-14T13:09:36,562 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.499462127685547|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844176 2022-02-14T13:09:36,562 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.74036407470703|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844176 2022-02-14T13:09:36,566 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844176 2022-02-14T13:09:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29497.00390625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844176 2022-02-14T13:09:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2245.2421875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844176 2022-02-14T13:09:36,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.3|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844176 2022-02-14T13:10:36,565 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844236 2022-02-14T13:10:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.499454498291016|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844236 2022-02-14T13:10:36,566 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.740371704101562|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844236 2022-02-14T13:10:36,567 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844236 2022-02-14T13:10:36,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29473.59765625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844236 2022-02-14T13:10:36,568 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2269.31640625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844236 2022-02-14T13:10:36,569 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.4|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844236 2022-02-14T13:11:36,563 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844296 2022-02-14T13:11:36,564 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.499439239501953|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844296 2022-02-14T13:11:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.740386962890625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844296 2022-02-14T13:11:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844296 2022-02-14T13:11:36,565 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29473.3671875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844296 2022-02-14T13:11:36,565 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2269.546875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844296 2022-02-14T13:11:36,565 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.4|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844296 2022-02-14T13:12:36,563 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844356 2022-02-14T13:12:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.499427795410156|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844356 2022-02-14T13:12:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.740398406982422|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844356 2022-02-14T13:12:36,565 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844356 2022-02-14T13:12:36,565 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29472.01171875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844356 2022-02-14T13:12:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2270.90625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844356 2022-02-14T13:12:36,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.4|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844356 2022-02-14T13:13:36,562 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844416 2022-02-14T13:13:36,562 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:12.49941635131836|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844416 2022-02-14T13:13:36,563 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:30.74040985107422|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844416 2022-02-14T13:13:36,563 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844416 2022-02-14T13:13:36,563 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29471.51171875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844416 2022-02-14T13:13:36,564 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2271.41015625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844416 2022-02-14T13:13:36,564 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.4|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844416 2022-02-14T13:14:36,561 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844476 2022-02-14T13:14:36,562 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:12.499404907226562|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844476 2022-02-14T13:14:36,562 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:30.740421295166016|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844476 2022-02-14T13:14:36,562 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844476 2022-02-14T13:14:36,563 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:29471.67578125|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844476 2022-02-14T13:14:36,563 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:2271.44140625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844476 2022-02-14T13:14:36,563 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:8.4|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844476 2022-02-14T13:15:36,563 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844536 2022-02-14T13:15:36,563 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:12.499401092529297|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844536 2022-02-14T13:15:36,563 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:30.74042510986328|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844536 2022-02-14T13:15:36,563 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844536 2022-02-14T13:15:36,563 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:29471.296875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844536 2022-02-14T13:15:36,564 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:2271.81640625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844536 2022-02-14T13:15:36,564 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:8.4|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844536 2022-02-14T13:16:36,561 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844596 2022-02-14T13:16:36,561 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:12.4993896484375|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844596 2022-02-14T13:16:36,561 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:30.740436553955078|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844596 2022-02-14T13:16:36,562 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:71.1|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844596 2022-02-14T13:16:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:29471.9609375|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844596 2022-02-14T13:16:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:2271.15625|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844596 2022-02-14T13:16:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:8.4|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844596 2022-02-14T13:17:25,704 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model fcn_resnet_101 2022-02-14T13:17:25,705 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model fcn_resnet_101 2022-02-14T13:17:25,705 [INFO ] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelManager - Model fcn_resnet_101 loaded. 2022-02-14T13:17:25,706 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelManager - updateModel: fcn_resnet_101, count: 2 2022-02-14T13:17:25,711 [DEBUG] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python3, /usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9001] 2022-02-14T13:17:25,711 [DEBUG] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python3, /usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-02-14T13:17:26,731 [INFO ] W-9001-fcn_resnet_101_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9001 2022-02-14T13:17:26,732 [INFO ] W-9001-fcn_resnet_101_1.0-stdout MODEL_LOG - [PID]107 2022-02-14T13:17:26,733 [INFO ] W-9001-fcn_resnet_101_1.0-stdout MODEL_LOG - Torch worker started. 2022-02-14T13:17:26,733 [DEBUG] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-fcn_resnet_101_1.0 State change null -> WORKER_STARTED 2022-02-14T13:17:26,733 [INFO ] W-9001-fcn_resnet_101_1.0-stdout MODEL_LOG - Python runtime: 3.6.9 2022-02-14T13:17:26,733 [INFO ] W-9000-fcn_resnet_101_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-02-14T13:17:26,735 [INFO ] W-9000-fcn_resnet_101_1.0-stdout MODEL_LOG - [PID]106 2022-02-14T13:17:26,736 [INFO ] W-9000-fcn_resnet_101_1.0-stdout MODEL_LOG - Torch worker started. 2022-02-14T13:17:26,736 [DEBUG] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-fcn_resnet_101_1.0 State change null -> WORKER_STARTED 2022-02-14T13:17:26,736 [INFO ] W-9000-fcn_resnet_101_1.0-stdout MODEL_LOG - Python runtime: 3.6.9 2022-02-14T13:17:26,737 [INFO ] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9001 2022-02-14T13:17:26,737 [INFO ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-02-14T13:17:26,745 [INFO ] W-9001-fcn_resnet_101_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9001. 2022-02-14T13:17:26,746 [INFO ] W-9000-fcn_resnet_101_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-02-14T13:17:26,747 [INFO ] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1644844646747 2022-02-14T13:17:26,747 [INFO ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1644844646747 2022-02-14T13:17:26,766 [INFO ] W-9001-fcn_resnet_101_1.0-stdout MODEL_LOG - model_name: fcn_resnet_101, batchSize: 1 2022-02-14T13:17:26,767 [INFO ] W-9000-fcn_resnet_101_1.0-stdout MODEL_LOG - model_name: fcn_resnet_101, batchSize: 1 2022-02-14T13:17:27,090 [INFO ] W-9000-fcn_resnet_101_1.0-stdout MODEL_LOG - Generating new fontManager, this may take some time... 2022-02-14T13:17:27,091 [INFO ] W-9001-fcn_resnet_101_1.0-stdout MODEL_LOG - Generating new fontManager, this may take some time... 2022-02-14T13:17:28,206 [WARN ] W-9000-fcn_resnet_101_1.0-stderr MODEL_LOG - Downloading: "https://download.pytorch.org/models/resnet101-63fe2227.pth" to /home/model-server/.cache/torch/hub/checkpoints/resnet101-63fe2227.pth 2022-02-14T13:17:28,302 [WARN ] W-9001-fcn_resnet_101_1.0-stderr MODEL_LOG - Downloading: "https://download.pytorch.org/models/resnet101-63fe2227.pth" to /home/model-server/.cache/torch/hub/checkpoints/resnet101-63fe2227.pth 2022-02-14T13:17:36,562 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844656 2022-02-14T13:17:36,562 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:12.29642105102539|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844656 2022-02-14T13:17:36,562 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:30.943405151367188|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844656 2022-02-14T13:17:36,562 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:71.6|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844656 2022-02-14T13:17:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:28763.41796875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844656 2022-02-14T13:17:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:2979.71875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844656 2022-02-14T13:17:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:10.6|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844656 2022-02-14T13:18:36,561 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844716 2022-02-14T13:18:36,561 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:12.29641342163086|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844716 2022-02-14T13:18:36,561 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:30.94341278076172|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844716 2022-02-14T13:18:36,561 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:71.6|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844716 2022-02-14T13:18:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:28763.65234375|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844716 2022-02-14T13:18:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:2979.48046875|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844716 2022-02-14T13:18:36,562 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:10.6|#Level:Host|#hostname:732c3c9747d3,timestamp:1644844716 2022-02-14T13:19:26,766 [INFO ] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 120000 2022-02-14T13:19:26,766 [ERROR] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1 2022-02-14T13:19:26,767 [INFO ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 120000 2022-02-14T13:19:26,767 [ERROR] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 2 2022-02-14T13:19:26,767 [ERROR] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) [model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-02-14T13:19:26,768 [ERROR] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) [model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-02-14T13:19:26,781 [WARN ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: fcn_resnet_101, error: Worker died. 2022-02-14T13:19:26,781 [WARN ] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: fcn_resnet_101, error: Worker died. 2022-02-14T13:19:26,781 [DEBUG] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-fcn_resnet_101_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-02-14T13:19:26,781 [DEBUG] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.ModelVersionedRefs - Removed model: fcn_resnet_101 version: 1.0 2022-02-14T13:19:26,782 [DEBUG] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-fcn_resnet_101_1.0 State change WORKER_STOPPED -> WORKER_SCALED_DOWN 2022-02-14T13:19:26,782 [WARN ] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-fcn_resnet_101_1.0-stderr 2022-02-14T13:19:26,782 [WARN ] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-fcn_resnet_101_1.0-stdout 2022-02-14T13:19:26,781 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9001 Worker disconnected. WORKER_STARTED 2022-02-14T13:19:26,782 [DEBUG] W-9001-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Worker terminated due to scale-down call. 2022-02-14T13:19:26,782 [WARN ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-fcn_resnet_101_1.0-stderr 2022-02-14T13:19:26,782 [WARN ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-fcn_resnet_101_1.0-stdout 2022-02-14T13:19:26,782 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-02-14T13:19:26,785 [DEBUG] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-fcn_resnet_101_1.0 State change WORKER_STARTED -> WORKER_SCALED_DOWN 2022-02-14T13:19:26,785 [WARN ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-fcn_resnet_101_1.0-stderr 2022-02-14T13:19:26,785 [WARN ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-fcn_resnet_101_1.0-stdout 2022-02-14T13:19:26,798 [INFO ] W-9001-fcn_resnet_101_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9001-fcn_resnet_101_1.0-stderr 2022-02-14T13:19:26,798 [INFO ] W-9001-fcn_resnet_101_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9001-fcn_resnet_101_1.0-stdout 2022-02-14T13:19:26,806 [INFO ] W-9000-fcn_resnet_101_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-fcn_resnet_101_1.0-stderr 2022-02-14T13:19:26,806 [INFO ] W-9000-fcn_resnet_101_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-fcn_resnet_101_1.0-stdout 2022-02-14T13:19:26,820 [INFO ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.ModelManager - Model fcn_resnet_101 unregistered. 2022-02-14T13:19:26,821 [DEBUG] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-fcn_resnet_101_1.0 State change WORKER_SCALED_DOWN -> WORKER_STOPPED 2022-02-14T13:19:26,821 [WARN ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-fcn_resnet_101_1.0-stderr 2022-02-14T13:19:26,821 [WARN ] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-fcn_resnet_101_1.0-stdout 2022-02-14T13:19:26,821 [DEBUG] W-9000-fcn_resnet_101_1.0 org.pytorch.serve.wlm.WorkerThread - Worker terminated due to scale-down call. 2022-02-14T13:19:26,824 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /192.168.0.1:57418 "POST /models?initial_workers=2&url=fcn_resnet_101.mar HTTP/1.1" 500 124862

HamidShojanazeri commented 2 years ago

@Aprilmhx I am not repro the issue, can you pls try the following on your end.


(image_seg) ubuntu@ip-172-31-49-104:~$ torch-model-archiver --model-name fcn_resnet_101 --version 1.0 --model-file examples/image_segmenter/fcn/model.py --serialized-file fcn_resnet101_coco-7ecb50ca.pth --handler image_segmenter --extra-files examples/image_segmenter/fcn/fcn.py,examples/image_segmenter/fcn/intermediate_layer_getter.py
(image_seg) ubuntu@ip-172-31-49-104:~$ mv fcn_resnet_101.mar ~/model_store
(image_seg) ubuntu@ip-172-31-49-104:~$ sudo docker run --rm -it -p 7000:8080 -p 7001:8081 -v $(pwd)/model_store:/home/model-server/model-store  pytorch/torchserve:latest

(image_seg) ubuntu@ip-172-31-49-104:~$ curl -X POST "http://127.0.0.1:7001/models?initial_workers=2&url=fcn_resnet_101.mar"
{
  "status": "Model \"fcn_resnet_101\" Version: 1.0 registered with 2 initial workers"
}
Aprilmhx commented 2 years ago

@HamidShojanazeri Thank you very much for your Reply!

I couldn't run the torch-model-archiver command locally because Torchserve's local environment was not installed successfully, so I pulled the torchserve docker image directly in step 3. After that I ran the command torch-model-archiver inside the docker container and got the .mar file. This file is also stored in the local folder via the bind-mount operation. On the local path I executed the command from step 5 and got "Failed to start workers for model fcn_resnet_101 version: 1.0"

I would like to ask if this works? Thank you in Advance.

Aprilmhx commented 2 years ago

This problem has been solved by running on the GPU

ZienZhang6 commented 1 year ago

@Aprilmhx 你好,我也遇到了跟你一模一样的问题。

This problem has been solved by running on the GPU

根据你的描述,你在GPU上运行解决了这个问题,这个我不是特别理解。如果方便的话,请详细描述一下你的操作细节,谢谢。

fclesio commented 9 months ago

For other folks with the same issue: If some any reason you're running based on CPU (e.g. due to the incompatibility with your OS (in my case Ubuntu 18.04) with your CUDA driver) this error will rise.

The solution for me was to move the API to run in a machine with GPU and all drivers aligned with the OS.

Engineero commented 4 months ago

I was able to get the workers to start with

curl -vX PUT "http://localhost:8081/models/{model-name}?min_worker=1"

You can also set max_worker in the same way. I did not verify this on the example model but it seemed to work on my own project. See the management API docs for more info.

I noticed when I inspected my model that both min and max workers were set to 0. After this command I was able to see them set to what I wanted. You can inspect your model and parameters like these with

curl localhost:8081/models/{model-name}

Edit: formatting and more context.