pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

torchserve worker died when running in AWS ECS but not on local #2362

Closed rbavery closed 1 year ago

rbavery commented 1 year ago

🐛 Describe the bug

I'm running https://github.com/developmentseed/segment-anything-services and have gotten to the point where I can run both the encoder model and decoder model locally without issues. However, when I deploy these on AWS Elastic Container Services, I get inexplicable Worker Died errors. I'm using a p2.xlarge.

First thing I'd like to check, is there a limitation with torchserve for what GPU types we can use? is p2.xlarge too old at this point?

Error logs

2023-05-22T18:16:17.120-07:00 | 2023-05-23T01:16:17,120 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:18706 "GET /ping HTTP/1.1" 200 0
-- | --
  | 2023-05-22T18:16:17.120-07:00 | 2023-05-23T01:16:17,120 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804577
  | 2023-05-22T18:16:24.466-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.466-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:9.436691284179688\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.466-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:20.551570892333984\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.466-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:68.5\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.466-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUtilization.Percent:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.466-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUsed.Megabytes:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.466-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - GPUUtilization.Percent:98.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.467-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:56896.82421875\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.467-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:3800.109375\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:24.467-07:00 | 2023-05-23T01:16:24,466 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:7.3\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804584
  | 2023-05-22T18:16:26.419-07:00 | 2023-05-23T01:16:26,419 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.7.41:19036 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:16:26.419-07:00 | 2023-05-23T01:16:26,419 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804586
  | 2023-05-22T18:16:47.150-07:00 | 2023-05-23T01:16:47,150 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:40450 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:16:47.150-07:00 | 2023-05-23T01:16:47,150 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804607
  | 2023-05-22T18:16:56.450-07:00 | 2023-05-23T01:16:56,450 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.7.41:30100 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:16:56.450-07:00 | 2023-05-23T01:16:56,450 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804616
  | 2023-05-22T18:17:17.165-07:00 | 2023-05-23T01:17:17,165 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:43980 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:17:17.165-07:00 | 2023-05-23T01:17:17,165 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804637
  | 2023-05-22T18:17:23.546-07:00 | 2023-05-23T01:17:23,546 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0\|#model_name:sam_vit_h_encode,model_version:default\|#hostname:34e4fda3054d,timestamp:1684804643
  | 2023-05-22T18:17:23.547-07:00 | 2023-05-23T01:17:23,547 [INFO ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1684804643546
  | 2023-05-22T18:17:23.552-07:00 | 2023-05-23T01:17:23,550 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Backend received inference at: 1684804643
  | 2023-05-22T18:17:23.552-07:00 | 2023-05-23T01:17:23,552 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - trying to open image
  | 2023-05-22T18:17:23.557-07:00 | 2023-05-23T01:17:23,556 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - trying to convert image
  | 2023-05-22T18:17:23.560-07:00 | 2023-05-23T01:17:23,560 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - XXXXX Preprocess time and shape (512, 512, 3): 0.009874582290649414
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:9.434722900390625\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:20.553539276123047\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:68.5\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUtilization.Percent:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUsed.Megabytes:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - GPUUtilization.Percent:99.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:56617.37890625\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:4079.5546875\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:24.500-07:00 | 2023-05-23T01:17:24,500 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:7.8\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804644
  | 2023-05-22T18:17:26.461-07:00 | 2023-05-23T01:17:26,461 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.7.41:56068 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:17:26.461-07:00 | 2023-05-23T01:17:26,461 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804646
  | 2023-05-22T18:17:30.702-07:00 | 2023-05-23T01:17:30,702 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /172.31.37.33:58242 "GET / HTTP/1.1" 405 0
  | 2023-05-22T18:17:30.702-07:00 | 2023-05-23T01:17:30,702 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests4XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804650
  | 2023-05-22T18:17:30.924-07:00 | 2023-05-23T01:17:30,924 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /172.31.37.33:58258 "GET /favicon.ico HTTP/1.1" 404 1
  | 2023-05-22T18:17:30.924-07:00 | 2023-05-23T01:17:30,924 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests4XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804650
  | 2023-05-22T18:17:35.266-07:00 | 2023-05-23T01:17:35,265 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:27304 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:17:35.266-07:00 | 2023-05-23T01:17:35,265 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804655
  | 2023-05-22T18:17:47.195-07:00 | 2023-05-23T01:17:47,191 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:5148 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:17:47.195-07:00 | 2023-05-23T01:17:47,191 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804667
  | 2023-05-22T18:17:56.474-07:00 | 2023-05-23T01:17:56,474 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.7.41:10744 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:17:56.474-07:00 | 2023-05-23T01:17:56,474 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804676
  | 2023-05-22T18:18:17.222-07:00 | 2023-05-23T01:18:17,221 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:39842 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:18:17.222-07:00 | 2023-05-23T01:18:17,221 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804697
  | 2023-05-22T18:18:25.308-07:00 | 2023-05-23T01:18:25,307 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.308-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:9.438671112060547\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.308-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:20.549591064453125\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.308-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:68.5\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.308-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUtilization.Percent:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.308-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUsed.Megabytes:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.309-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - GPUUtilization.Percent:100.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.309-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:56396.15234375\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.309-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:4300.78125\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:25.309-07:00 | 2023-05-23T01:18:25,308 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:8.1\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804705
  | 2023-05-22T18:18:26.505-07:00 | 2023-05-23T01:18:26,504 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.7.41:38774 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:18:26.505-07:00 | 2023-05-23T01:18:26,505 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804706
  | 2023-05-22T18:18:47.245-07:00 | 2023-05-23T01:18:47,245 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:47348 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:18:47.245-07:00 | 2023-05-23T01:18:47,245 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804727
  | 2023-05-22T18:18:56.513-07:00 | 2023-05-23T01:18:56,513 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.7.41:50932 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:18:56.513-07:00 | 2023-05-23T01:18:56,513 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804736
  | 2023-05-22T18:19:17.290-07:00 | 2023-05-23T01:19:17,289 [INFO ] pool-2-thread-10 ACCESS_LOG - /172.31.37.33:32304 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:19:17.290-07:00 | 2023-05-23T01:19:17,289 [INFO ] pool-2-thread-10 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804757
  | 2023-05-22T18:19:23.550-07:00 | 2023-05-23T01:19:23,550 [ERROR] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
  | 2023-05-22T18:19:23.601-07:00 | 2023-05-23T01:19:23,550 [ERROR] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
  | 2023-05-22T18:19:23.601-07:00 | org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
  | 2023-05-22T18:19:23.601-07:00 | at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:224) [model-server.jar:?]
  | 2023-05-22T18:19:23.601-07:00 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
  | 2023-05-22T18:19:23.601-07:00 | at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
  | 2023-05-22T18:19:23.601-07:00 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
  | 2023-05-22T18:19:23.601-07:00 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
  | 2023-05-22T18:19:23.601-07:00 | at java.lang.Thread.run(Thread.java:833) [?:?]
  | 2023-05-22T18:19:23.602-07:00 | 2023-05-23T01:19:23,601 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
  | 2023-05-22T18:19:23.603-07:00 | 2023-05-23T01:19:23,603 [INFO ] W-9000-sam_vit_h_encode_1.0.0 ACCESS_LOG - /172.31.7.41:56056 "POST /predictions/sam_vit_h_encode HTTP/1.1" 500 120060
  | 2023-05-22T18:19:23.603-07:00 | 2023-05-23T01:19:23,603 [INFO ] W-9000-sam_vit_h_encode_1.0.0 TS_METRICS - Requests5XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804763
  | 2023-05-22T18:19:23.604-07:00 | 2023-05-23T01:19:23,604 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.job.Job - Waiting time ns: 251810, Inference time ns: 120057797759
  | 2023-05-22T18:19:23.605-07:00 | 2023-05-23T01:19:23,604 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-sam_vit_h_encode_1.0.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
  | 2023-05-22T18:19:23.605-07:00 | 2023-05-23T01:19:23,605 [INFO ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1684804763605
  | 2023-05-22T18:19:23.606-07:00 | 2023-05-23T01:19:23,606 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-sam_vit_h_encode_1.0.0-stderr
  | 2023-05-22T18:19:23.606-07:00 | 2023-05-23T01:19:23,606 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-sam_vit_h_encode_1.0.0-stdout
  | 2023-05-22T18:19:23.607-07:00 | 2023-05-23T01:19:23,607 [INFO ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
  | 2023-05-22T18:19:23.853-07:00 | 2023-05-23T01:19:23,852 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-sam_vit_h_encode_1.0.0-stdout
  | 2023-05-22T18:19:23.853-07:00 | 2023-05-23T01:19:23,852 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-sam_vit_h_encode_1.0.0-stderr
  | 2023-05-22T18:19:24.608-07:00 | 2023-05-23T01:19:24,607 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/volume/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:100.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:9.43661880493164\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:20.55164337158203\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:68.5\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUtilization.Percent:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUsed.Megabytes:0.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - GPUUtilization.Percent:97.0\|#Level:Host,DeviceId:0\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.491-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:59453.1796875\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.492-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:1243.75390625\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:25.492-07:00 | 2023-05-23T01:19:25,491 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:3.1\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804765
  | 2023-05-22T18:19:26.508-07:00 | 2023-05-23T01:19:26,508 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - s_name_part0=/home/model-server/volume/tmp/.ts.sock, s_name_part1=9000, pid=2809
  | 2023-05-22T18:19:26.508-07:00 | 2023-05-23T01:19:26,508 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Listening on port: /home/model-server/volume/tmp/.ts.sock.9000
  | 2023-05-22T18:19:26.519-07:00 | 2023-05-23T01:19:26,519 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
  | 2023-05-22T18:19:26.519-07:00 | 2023-05-23T01:19:26,519 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - [PID]2809
  | 2023-05-22T18:19:26.520-07:00 | 2023-05-23T01:19:26,519 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Torch worker started.
  | 2023-05-22T18:19:26.520-07:00 | 2023-05-23T01:19:26,519 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Python runtime: 3.9.16
  | 2023-05-22T18:19:26.520-07:00 | 2023-05-23T01:19:26,520 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-sam_vit_h_encode_1.0.0 State change WORKER_STOPPED -> WORKER_STARTED
  | 2023-05-22T18:19:26.520-07:00 | 2023-05-23T01:19:26,520 [INFO ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/volume/tmp/.ts.sock.9000
  | 2023-05-22T18:19:26.523-07:00 | 2023-05-23T01:19:26,522 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Connection accepted: /home/model-server/volume/tmp/.ts.sock.9000.
  | 2023-05-22T18:19:26.524-07:00 | 2023-05-23T01:19:26,523 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
  | 2023-05-22T18:19:26.524-07:00 | 2023-05-23T01:19:26,523 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
  | 2023-05-22T18:19:26.524-07:00 | 2023-05-23T01:19:26,523 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
  | 2023-05-22T18:19:26.524-07:00 | java.lang.InterruptedException: null
  | 2023-05-22T18:19:26.524-07:00 | at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
  | 2023-05-22T18:19:26.524-07:00 | at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
  | 2023-05-22T18:19:26.524-07:00 | at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
  | 2023-05-22T18:19:26.524-07:00 | at org.pytorch.serve.wlm.Model.pollBatch(Model.java:260) ~[model-server.jar:?]
  | 2023-05-22T18:19:26.524-07:00 | at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
  | 2023-05-22T18:19:26.524-07:00 | at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
  | 2023-05-22T18:19:26.524-07:00 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
  | 2023-05-22T18:19:26.524-07:00 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
  | 2023-05-22T18:19:26.524-07:00 | at java.lang.Thread.run(Thread.java:833) [?:?]
  | 2023-05-22T18:19:26.527-07:00 | 2023-05-23T01:19:26,524 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-sam_vit_h_encode_1.0.0 State change WORKER_STARTED -> WORKER_STOPPED
  | 2023-05-22T18:19:26.527-07:00 | 2023-05-23T01:19:26,524 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
  | 2023-05-22T18:19:26.527-07:00 | 2023-05-23T01:19:26,524 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-sam_vit_h_encode_1.0.0-stderr
  | 2023-05-22T18:19:26.527-07:00 | 2023-05-23T01:19:26,524 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-sam_vit_h_encode_1.0.0-stdout
  | 2023-05-22T18:19:26.527-07:00 | 2023-05-23T01:19:26,525 [INFO ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
  | 2023-05-22T18:19:26.537-07:00 | 2023-05-23T01:19:26,537 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-sam_vit_h_encode_1.0.0-stdout
  | 2023-05-22T18:19:26.537-07:00 | 2023-05-23T01:19:26,537 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-sam_vit_h_encode_1.0.0-stderr
  | 2023-05-22T18:19:26.544-07:00 | 2023-05-23T01:19:26,544 [INFO ] W-9000-sam_vit_h_encode_1.0.0 ACCESS_LOG - /172.31.7.41:23634 "GET /ping HTTP/1.1" 200 0
  | 2023-05-22T18:19:26.544-07:00 | 2023-05-23T01:19:26,544 [INFO ] W-9000-sam_vit_h_encode_1.0.0 TS_METRICS - Requests2XX.Count:1.0\|#Level:Host\|#hostname:34e4fda3054d,timestamp:1684804766
  | 2023-05-22T18:19:27.525-07:00 | 2023-05-23T01:19:27,525 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/volume/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
  | 2023-05-22T18:19:29.208-07:00 | 2023-05-23T01:19:29,207 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - s_name_part0=/home/model-server/volume/tmp/.ts.sock, s_name_part1=9000, pid=2816
  | 2023-05-22T18:19:29.208-07:00 | 2023-05-23T01:19:29,208 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Listening on port: /home/model-server/volume/tmp/.ts.sock.9000
  | 2023-05-22T18:19:29.219-07:00 | 2023-05-23T01:19:29,219 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
  | 2023-05-22T18:19:29.219-07:00 | 2023-05-23T01:19:29,219 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - [PID]2816
  | 2023-05-22T18:19:29.219-07:00 | 2023-05-23T01:19:29,219 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Torch worker started.
  | 2023-05-22T18:19:29.219-07:00 | 2023-05-23T01:19:29,219 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-sam_vit_h_encode_1.0.0 State change WORKER_STOPPED -> WORKER_STARTED
  | 2023-05-22T18:19:29.220-07:00 | 2023-05-23T01:19:29,219 [INFO ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/volume/tmp/.ts.sock.9000
  | 2023-05-22T18:19:29.220-07:00 | 2023-05-23T01:19:29,220 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Python runtime: 3.9.16
  | 2023-05-22T18:19:29.221-07:00 | 2023-05-23T01:19:29,221 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
  | 2023-05-22T18:19:29.221-07:00 | 2023-05-23T01:19:29,221 [INFO ] W-9000-sam_vit_h_encode_1.0.0-stdout MODEL_LOG - Connection accepted: /home/model-server/volume/tmp/.ts.sock.9000.
  | 2023-05-22T18:19:29.221-07:00 | 2023-05-23T01:19:29,221 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
  | 2023-05-22T18:19:29.222-07:00 | 2023-05-23T01:19:29,221 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
  | 2023-05-22T18:19:29.222-07:00 | java.lang.InterruptedException: null
  | 2023-05-22T18:19:29.222-07:00 | at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
  | 2023-05-22T18:19:29.222-07:00 | at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?]
  | 2023-05-22T18:19:29.222-07:00 | at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:415) ~[model-server.jar:?]
  | 2023-05-22T18:19:29.222-07:00 | at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
  | 2023-05-22T18:19:29.222-07:00 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
  | 2023-05-22T18:19:29.222-07:00 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
  | 2023-05-22T18:19:29.222-07:00 | at java.lang.Thread.run(Thread.java:833) [?:?]
  | 2023-05-22T18:19:29.222-07:00 | 2023-05-23T01:19:29,222 [DEBUG] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-sam_vit_h_encode_1.0.0 State change WORKER_STARTED -> WORKER_STOPPED
  | 2023-05-22T18:19:29.222-07:00 | 2023-05-23T01:19:29,222 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
  | 2023-05-22T18:19:29.222-07:00 | 2023-05-23T01:19:29,222 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-sam_vit_h_encode_1.0.0-stderr
  | 2023-05-22T18:19:29.222-07:00 | 2023-05-23T01:19:29,222 [WARN ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-sam_vit_h_encode_1.0.0-stdout
  | 2023-05-22T18:19:29.222-07:00 | 2023-05-23T01:19:29,222 [INFO ] W-9000-sam_vit_h_encode_1.0.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds.

Installation instructions

using docker https://github.com/developmentseed/segment-anything-services/blob/main/Dockerfile-gpu

Model Packaing

https://github.com/developmentseed/segment-anything-services/blob/main/Dockerfile-build

see readme for instructions on running the .mar build step and the torchserve services: https://github.com/developmentseed/segment-anything-services

config.properties

https://github.com/developmentseed/segment-anything-services/blob/main/deployment/config_encode.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
number_of_netty_threads=1
default_workers_per_model=1
job_queue_size=1000
model_store=/home/model-server/volume/model-store
load_models=all
vmargs=-Xmx8g
max_request_size=17797905
max_response_size=17797905

Versions

we're using FROM pytorch/torchserve:latest-gpu in https://github.com/developmentseed/segment-anything-services/blob/main/Dockerfile-gpu

Repro instructions

https://github.com/developmentseed/segment-anything-services

unfortunately this only happens on AWS ECR. so repro is a bit tricky. You can see the cloudformation stack and run this on your own by hooking it up to an aws account. Or, any suggestions on what might need to be specified in the config to ge tthis to work on ECR are appreciated.

Possible Solution

I'll try different gpu instances, but I haven't seen any logs indicating the K80 gpus of the p2.xlarge are not supported by torchserve:latest. Suggestions appreciated!

agunapal commented 1 year ago

I'm trying to find the compatibility matrix for this GPU. latest torchserve uses PyTorch 2.0 which uses CUDA 11.7/11.8 . I'm not sure if its supported on this old GPU. Will confirm

rbavery commented 1 year ago

looks like a K80 has Kepler sm_37 arch

and that this should be deprecated in CUDA 11.7 and 11.8.

I'll try out a newer arch

agunapal commented 1 year ago

Hi @rbavery Wondering if your issue is resolved.

rbavery commented 1 year ago

yes I think this was because of a mismatch between the K80 and the nvidia driver I was using.