pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

Model not loaded - MODEL_LOG - what(): std::bad_alloc - Torchserve version: 0.5.2 #1663

Closed festinais closed 2 years ago

festinais commented 2 years ago

πŸ› Describe the bug

Torschserve:: Name: torch-model-archiver Version: 0.6.0 The command to create the MAR file: torch-model-archiver --model-name bodypartrec --version 1.0 --serialized-file model.pt --handler model_handler.py --requirements-file requir/requirements.txt

KServe: Inferenceservice:

apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: namespace: "kserve-test" name: "my-model" spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: "v2" storageUri: "s3://.../kserve/"

The error when torchserve is started with kserve is as described in the logs. So basically: "what(): std::bad_alloc" is happening. There are no clues on how to proceed with. I saw that this issue is marked as a bug, but I don't know in what case this is triggered!? Any help is really appreciated!

Error logs

Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init) WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. /usr/local/lib/python3.8/dist-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command. warnings.warn( 2022-06-01T13:52:17,164 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2022-06-01T13:52:17,276 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.5.2 TS Home: /usr/local/lib/python3.8/dist-packages Current directory: /home/model-server Temp directory: /home/model-server/tmp Number of GPUs: 0 Number of CPUs: 24 Max heap size: 512 M Python executable: /usr/bin/python Config file: /mnt/models/config/config.properties Inference address: http://0.0.0.0:8085 Management address: http://0.0.0.0:8085 Metrics address: http://0.0.0.0:8082 Model Store: /mnt/models/model-store Initial Models: N/A Log dir: /home/model-server/logs Metrics dir: /home/model-server/logs Netty threads: 4 Netty client threads: 0 Default workers per model: 24 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?://.] Custom python dependency for model allowed: true Metrics report format: prometheus Enable metrics API: true Workflow Store: /mnt/models/model-store Model config: N/A 2022-06-01T13:52:17,283 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2022-06-01T13:52:17,321 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {"name":"startup.cfg","modelCount":1,"models":{"my-model":{"1.0":{"defaultVersion":true,"marName":"bodypartrec.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"responseTimeout":120, "maxBatchDelay":50}}}} 2022-06-01T13:52:17,328 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg 2022-06-01T13:52:17,329 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully [I 220601 13:52:17 main:75] Wrapper : Model names ['my-model'], inference address http//0.0.0.0:8085, management address http://0.0.0.0:8085, model store /mnt/models/model-store [I 220601 13:52:17 TorchserveModel:54] kfmodel Predict URL set to 0.0.0.0:8085 [I 220601 13:52:17 TorchserveModel:56] kfmodel Explain URL set to 0.0.0.0:8085 [I 220601 13:52:17 TSModelRepository:30] TSModelRepo is initialized [I 220601 13:52:17 kfserver:150] Registering model: my-model [I 220601 13:52:17 kfserver:120] Setting asyncio max_workers as 28 [I 220601 13:52:17 kfserver:127] Listening on port 8080 [I 220601 13:52:17 kfserver:129] Will fork 1 workers 2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model my-model 2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model 2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model 2022-06-01T13:52:50,088 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model my-model loaded. 2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: my-model, count: 1 2022-06-01T13:52:50,096 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:52:50,097 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2022-06-01T13:52:50,158 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8085 2022-06-01T13:52:50,159 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2022-06-01T13:52:50,160 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 Model server started. 2022-06-01T13:52:50,363 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. 2022-06-01T13:52:50,415 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:40.0|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:94.23827743530273|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:774.5930213928223|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.2|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:23591.0078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:8041.05078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,419 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:26.5|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,766 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]50 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:52:50,768 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change null -> WORKER_STARTED 2022-06-01T13:52:50,774 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:50,784 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:52:50,787 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091570787 2022-06-01T13:52:50,819 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:52:50,888 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of 'std::bad_alloc' 2022-06-01T13:52:50,889 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:52:58,067 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:52:58,090 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:52:58,091 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:52:58,092 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2022-06-01T13:52:59,092 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:52:59,812 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]107 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:52:59,813 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091579815 2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:52:59,836 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:52:59,912 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of 'std::bad_alloc' 2022-06-01T13:52:59,913 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:03,968 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:03,969 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:03,969 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:03,972 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:03,973 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:03,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2022-06-01T13:53:04,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:05,584 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]135 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:05,585 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:05,586 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091585588 2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:05,608 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:05,680 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of 'std::bad_alloc' 2022-06-01T13:53:05,681 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:09,625 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:09,633 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:09,633 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds. 2022-06-01T13:53:11,634 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:12,255 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]163 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:12,256 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:12,257 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091592259 2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:12,284 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:12,361 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of 'std::bad_alloc' 2022-06-01T13:53:12,362 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:16,966 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:16,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:16,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 3 seconds. 2022-06-01T13:53:19,975 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:20,544 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]191 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:20,545 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091600547 2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:20,569 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:20,644 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of 'std::bad_alloc' 2022-06-01T13:53:20,645 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:25,279 [INFO ] epollEventLoopGroup-5-5 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:25,287 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:25,288 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 5 seconds. 2022-06-01T13:53:30,288 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:31,052 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]219 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:31,053 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091611055 2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:31,056 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:31,130 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of 'std::bad_alloc' 2022-06-01T13:53:31,131 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:35,936 [INFO ] epollEventLoopGroup-5-6 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:35,941 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:35,942 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 8 seconds.

Installation instructions

No, torchserve is running inside kserve

Model Packaing

step by step with torch-model-archiver

config.properties

install_py_dep_per_model=true

Versions

Name: torch Version: 1.9.0+cu111

Name: torch-model-archiver Version: 0.6.0

Repro instructions

kubectl apply -f bodypart.yaml -n kserve-test

apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: namespace: "kserve-test" name: "my-model" spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: "v2" storageUri: "s3://.../kserve/"

Possible Solution

No response

msaroufim commented 2 years ago

2 questions

If it's a no for both I can take a look tomorrow since this may be a memory leak

Related to https://github.com/pytorch/serve/issues/1587

festinais commented 2 years ago

I checked with torchserve 0.6.0 it happens the same. With torchserve 0.2.0 and 0.4.0 is another error:

2022-06-01T15:31:36,576 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket) 2022-06-01T15:31:36,577 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - File "/usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py", line 123, in handle_connection 2022-06-01T15:31:36,577 [INFO ] epollEventLoopGroup-5-6 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T15:31:36,577 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg) 2022-06-01T15:31:36,577 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T15:31:36,577 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - File "/usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py", line 95, in load_model 2022-06-01T15:31:36,577 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - service = model_loader.load(model_name, model_dir, handler, gpu, 2022-06-01T15:31:36,577 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - TypeError: load() takes from 6 to 7 positional arguments but 8 were given 2022-06-01T15:31:36,577 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?]

As for the memory, I'm not aware that I have set any memory constraints. I didn't set any constraints.

msaroufim commented 2 years ago

Hi @FestinaI I just tried reproing this with our vanilla docker images and found a problem with 0.5.3 but 0.5.2 and 0.6.0 are fine so this is not a problem related to the image version and is something else from https://github.com/pytorch/serve/issues/1587#issuecomment-1144229504 despite the similar error.

Can you please write a minimal repro or link me the exact files and requirements you used? So I can copy paste commands and get the exact same error you did?

 torch-model-archiver --model-name bodypartrec --version 1.0 --serialized-file model.pt --handler model_handler.py --requirements-file requir/requirements.txt

For example here I don't have your handler, requirements or .pt file. I also don't know how you started your cluster or how you launched torchserve in order

When I tried running your kubectl command I got this

(serve) ubuntu@ip-172-31-16-198:~$ kubectl apply -f bodypart.yaml -n kserve-test The connection to the server localhost:8080 was refused - did you specify the right host or port?

So an exact and if possible minimal repro will help

festinais commented 2 years ago

So I found the trigger to it. Basically I'm using KServe, and torchserve is started in the kserve env. When I created the MAR file I passed the monai==0.8.1 dependency through the --requirements-file requirements.txt, where I also added the install_py_dep_per_model=true in the config.properties file. Testing with the MAR file created this way led to error in the end.

When I created the MAR file without any requirements file then it worked properly with KServe, otherwise as I explained the std::bad_alloc was happening with version 0.5 and 0.6.0 and the other error with older torchserve versions.

So I'm guessing this is due dependencies version conflicts, but I'm not sure.

msaroufim commented 2 years ago

OK that's good'ish news, thanks for looking into it - I don't think a requirements.txt should just break everything though. Can you repro this issue without kserve, just a vanilla torchserve setup with monai==0.8.1

festinais commented 2 years ago

No I couldn't reproduce it without kserve. I tried with different python envs 3.8-3.10, and everything worked fine. The error always happened with kserve.

jagadeeshi2i commented 2 years ago

@FestinaI KServe 0.8.0 uses torchserve 0.5.2 dev build which has been reported with issue #1587 Can you try installing KServe from source which has torchserve 0.6.0.

You could run torchserve and kserve locally by following this doc - https://github.com/pytorch/serve/tree/master/kubernetes/kserve/kf_request_json/v2/mnist#torchserve-example-with-mnist-model

Torchserve logs with kserve TS: 0.6.0 KServe: dev

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-06-06T12:16:11,459 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2022-06-06T12:16:11,856 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.6.0
TS Home: /usr/local/lib/python3.8/dist-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 1
Max heap size: 494 M
Python executable: /usr/bin/python
Config file: /mnt/models/config/config.properties
Inference address: http://0.0.0.0:8085
Management address: http://0.0.0.0:8085
Metrics address: http://0.0.0.0:8082
Model Store: /mnt/models/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 4
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /mnt/models/model-store
Model config: N/A
2022-06-06T12:16:11,939 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2022-06-06T12:16:11,973 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {"name":"startup.cfg","modelCount":1,"models":{"mnist":{"1.0":{"defaultVersion":true,"marName":"mnist.mar","minWorkers":1,"maxWorkers":5,"batchSize":2,"maxBatchDelay":500,"responseTimeout":60}}}}
2022-06-06T12:16:12,047 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg
2022-06-06T12:16:12,048 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully
2022-06-06T12:16:12,546 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist
2022-06-06T12:16:12,546 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist
2022-06-06T12:16:12,546 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist
2022-06-06T12:16:12,547 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model mnist loaded.
2022-06-06T12:16:12,547 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: mnist, count: 1
2022-06-06T12:16:12,559 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-06-06T12:16:12,561 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-06-06T12:16:12,944 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8085
2022-06-06T12:16:12,944 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-06-06T12:16:12,946 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
[I 220606 12:16:13 __main__:75] Wrapper : Model names ['mnist'], inference address http//0.0.0.0:8085, management address http://0.0.0.0:8085, model store /mnt/models/model-store
[I 220606 12:16:13 TorchserveModel:54] kfmodel Predict URL set to 0.0.0.0:8085
[I 220606 12:16:13 TorchserveModel:56] kfmodel Explain URL set to 0.0.0.0:8085
[I 220606 12:16:13 TSModelRepository:30] TSModelRepo is initialized
[I 220606 12:16:13 model_server:150] Registering model: mnist
[I 220606 12:16:13 model_server:123] Listening on port 8080
[I 220606 12:16:13 model_server:125] Will fork 1 workers
[I 220606 12:16:13 model_server:128] Setting max asyncio worker threads as 5
Model server started.
2022-06-06T12:16:13,839 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2022-06-06T12:16:13,980 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517773
2022-06-06T12:16:13,980 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:64.83048248291016|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517773
2022-06-06T12:16:13,981 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:15.157779693603516|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517773
2022-06-06T12:16:13,981 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:19.0|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517773
2022-06-06T12:16:13,981 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:241758.83203125|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517773
2022-06-06T12:16:13,981 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1893.73828125|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517773
2022-06-06T12:16:13,982 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.6|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517773
2022-06-06T12:16:14,277 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-06-06T12:16:14,278 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - [PID]78
2022-06-06T12:16:14,278 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - Torch worker started.
2022-06-06T12:16:14,278 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - Python runtime: 3.8.0
2022-06-06T12:16:14,279 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-mnist_1.0 State change null -> WORKER_STARTED
2022-06-06T12:16:14,286 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-06-06T12:16:14,295 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-06-06T12:16:14,298 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654517774298
2022-06-06T12:16:14,385 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - model_name: mnist, batchSize: 2
2022-06-06T12:16:14,931 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - generated new fontManager
2022-06-06T12:16:15,558 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - Missing the index_to_name.json file. Inference output will not include class name.
2022-06-06T12:16:15,565 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1180
2022-06-06T12:16:15,566 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-mnist_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2022-06-06T12:16:15,566 [INFO ] W-9000-mnist_1.0 TS_METRICS - W-9000-mnist_1.0.ms:3011|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517775
2022-06-06T12:16:15,566 [INFO ] W-9000-mnist_1.0 TS_METRICS - WorkerThreadTime.ms:88|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517775
2022-06-06T12:17:13,805 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:16.7|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517833
2022-06-06T12:17:13,806 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:64.8305778503418|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517833
2022-06-06T12:17:13,806 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:15.157684326171875|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517833
2022-06-06T12:17:13,806 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:18.9|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517833
2022-06-06T12:17:13,806 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:241658.0|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517833
2022-06-06T12:17:13,806 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1993.984375|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517833
2022-06-06T12:17:13,807 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.6|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654517833
[I 220606 12:22:19 TorchserveModel:78] PREDICTOR_HOST : 0.0.0.0:8085
2022-06-06T12:22:20,458 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654518140458
2022-06-06T12:22:20,460 [INFO ] W-9000-mnist_1.0-stdout MODEL_LOG - Backend received inference at: 1654518140
2022-06-06T12:22:20,473 [WARN ] W-9000-mnist_1.0-stderr MODEL_LOG - /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:1320: UserWarning: dropout2d: Received a 2-D input to dropout2d, which is deprecated and will result in an error in a future release. To retain the behavior and silence this warning, please use dropout instead. Note that dropout2d exists to provide channel-wise dropout on inputs with 2 spatial dimensions, a channel dimension, and an optional batch dimension (i.e. 3D or 4D inputs).
2022-06-06T12:22:20,473 [INFO ] W-9000-mnist_1.0-stdout MODEL_METRICS - HandlerTime.Milliseconds:13.0|#ModelName:mnist,Level:Model|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,requestID:4f6a68e2-11d1-4f35-ac40-d46f0c071f0d,timestamp:1654518140
2022-06-06T12:22:20,474 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 15
2022-06-06T12:22:20,474 [WARN ] W-9000-mnist_1.0-stderr MODEL_LOG -   warnings.warn(warn_msg)
2022-06-06T12:22:20,474 [INFO ] W-9000-mnist_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:13.18|#ModelName:mnist,Level:Model|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,requestID:4f6a68e2-11d1-4f35-ac40-d46f0c071f0d,timestamp:1654518140
2022-06-06T12:22:20,538 [INFO ] W-9000-mnist_1.0 ACCESS_LOG - /127.0.0.1:37930 "POST /v1/models/mnist:predict HTTP/1.1" 200 588
2022-06-06T12:22:20,538 [INFO ] W-9000-mnist_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654518139
2022-06-06T12:22:20,539 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.job.Job - Waiting time ns: 500307958, Backend time ns: 80821477
2022-06-06T12:22:20,539 [INFO ] W-9000-mnist_1.0 TS_METRICS - QueueTime.ms:500|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654518140
2022-06-06T12:22:20,539 [INFO ] W-9000-mnist_1.0 TS_METRICS - WorkerThreadTime.ms:66|#Level:Host|#hostname:torchserve-mnist-v2-predictor-default-00001-deployment-675xf4qd,timestamp:1654518140
[I 220606 12:22:20 web:2239] 200 POST /v2/models/mnist/infer (127.0.0.1) 648.27ms
bamboosdu commented 2 years ago

same to me 2022-06-08 15:26:47,431 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000 2022-06-08 15:26:47,431 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]19125 2022-06-08 15:26:47,431 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started. 2022-06-08 15:26:47,431 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.7.9 2022-06-08 15:26:47,432 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-08 15:26:47,432 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000 2022-06-08 15:26:47,433 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000. 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died. 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_loader.py", line 84, in load 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(module_name) 2022-06-08 15:26:47,582 [INFO ] epollEventLoopGroup-5-9 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/importlib/__init__.py", line 127, in import_module 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) 2022-06-08 15:26:47,582 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 1006, in _gcd_import 2022-06-08 15:26:47,582 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2022-06-08 15:26:47,582 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: maskrcnn, error: Worker died. 2022-06-08 15:26:47,583 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-08 15:26:47,583 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stderr 2022-06-08 15:26:47,583 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stdout 2022-06-08 15:26:47,582 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 983, in _find_and_load 2022-06-08 15:26:47,583 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 34 seconds. 2022-06-08 15:26:47,583 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stdout 2022-06-08 15:26:47,585 [INFO ] W-9000-maskrcnn_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stderr 2022-06-08 15:26:51,906 [INFO ] pool-2-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:qianzhi-System-Product-Name,timestamp:1654673211 2022-06-08 15:26:51,907 [INFO ] pool-2-thread-2 TS_METRICS - DiskAvailable.Gigabytes:45.78533935546875|#Level:Host|#hostname:qianzhi-System-Product-Name,timestamp:1654673211 2022-06-08 15:26:51,907 [INFO ] pool-2-thread-2 TS_METRICS - DiskUsage.Gigabytes:42.80013656616211|#Level:Host|#hostname:qianzhi-System-Product-Name,timestamp:1654673211 2022-06-08 15:26:51,907 [INFO ] pool-2-thread-2 TS_METRICS - DiskUtilization.Percent:48.3|#Level:Host|#hostname:qianzhi-System-Product-Name,timestamp:1654673211 2022-06-08 15:26:51,907 [INFO ] pool-2-thread-2 TS_METRICS - MemoryAvailable.Megabytes:29351.25|#Level:Host|#hostname:qianzhi-System-Product-Name,timestamp:1654673211 2022-06-08 15:26:51,907 [INFO ] pool-2-thread-2 TS_METRICS - MemoryUsed.Megabytes:2070.296875|#Level:Host|#hostname:qianzhi-System-Product-Name,timestamp:1654673211 2022-06-08 15:26:51,907 [INFO ] pool-2-thread-2 TS_METRICS - MemoryUtilization.Percent:8.1|#Level:Host|#hostname:qianzhi-System-Product-Name,timestamp:1654673211 2022-06-08 15:27:21,624 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000 2022-06-08 15:27:21,624 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]19977 2022-06-08 15:27:21,624 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started. 2022-06-08 15:27:21,624 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.7.9 2022-06-08 15:27:21,624 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-08 15:27:21,624 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000 2022-06-08 15:27:21,625 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000. 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died. 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_loader.py", line 84, in load 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(module_name) 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/importlib/__init__.py", line 127, in import_module 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 1006, in _gcd_import 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 983, in _find_and_load 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'object_detector' 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred: 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 176, in <module> 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server() 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 148, in run_server 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket) 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 112, in handle_connection 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg) 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 85, in load_model 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size) 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_loader.py", line 88, in load 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(module_name, 'ts.torch_handler') 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/importlib/__init__.py", line 127, in import_module 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 1006, in _gcd_import 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 983, in _find_and_load 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 677, in _load_unlocked 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap_external>", line 728, in exec_module 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/torch_handler/object_detector.py", line 5, in <module> 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from torchvision import transforms 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/__init__.py", line 6, in <module> 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from torchvision import models 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/__init__.py", line 5, in <module> 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from .inception import * 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/inception.py", line 63, in <module> 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - class Inception3(nn.Module): 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/inception.py", line 195, in Inception3 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - @torch.jit.unused 2022-06-08 15:27:21,770 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - AttributeError: module 'torch.jit' has no attribute 'unused' 2022-06-08 15:27:21,771 [INFO ] epollEventLoopGroup-5-10 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-08 15:27:21,771 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-08 15:27:21,771 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2022-06-08 15:27:21,771 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: maskrcnn, error: Worker died. 2022-06-08 15:27:21,771 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-08 15:27:21,771 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stderr 2022-06-08 15:27:21,771 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stdout 2022-06-08 15:27:21,771 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 55 seconds. 2022-06-08 15:27:21,774 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stdout 2022-06-08 15:27:21,774 [INFO ] W-9000-maskrcnn_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stderr 2022-06-08 15:28:16,823 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000 2022-06-08 15:28:16,823 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]21355 2022-06-08 15:28:16,823 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started. 2022-06-08 15:28:16,823 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.7.9 2022-06-08 15:28:16,823 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-08 15:28:16,823 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000 2022-06-08 15:28:16,824 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000. 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died. 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_loader.py", line 84, in load 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(module_name) 2022-06-08 15:28:16,942 [INFO ] epollEventLoopGroup-5-11 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/importlib/__init__.py", line 127, in import_module 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 1006, in _gcd_import 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 983, in _find_and_load 2022-06-08 15:28:16,942 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked 2022-06-08 15:28:16,942 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2022-06-08 15:28:16,942 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'object_detector' 2022-06-08 15:28:16,943 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: maskrcnn, error: Worker died. 2022-06-08 15:28:16,943 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 2022-06-08 15:28:16,943 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred: 2022-06-08 15:28:16,943 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 2022-06-08 15:28:16,943 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2022-06-08 15:28:16,943 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 176, in <module> 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server() 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 148, in run_server 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket) 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 112, in handle_connection 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg) 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 85, in load_model 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size) 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_loader.py", line 88, in load 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(module_name, 'ts.torch_handler') 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/importlib/__init__.py", line 127, in import_module 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 1006, in _gcd_import 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 983, in _find_and_load 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 677, in _load_unlocked 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap_external>", line 728, in exec_module 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/torch_handler/object_detector.py", line 5, in <module> 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from torchvision import transforms 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/__init__.py", line 6, in <module> 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from torchvision import models 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/__init__.py", line 5, in <module> 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from .inception import * 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/inception.py", line 63, in <module> 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - class Inception3(nn.Module): 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/inception.py", line 195, in Inception3 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - @torch.jit.unused 2022-06-08 15:28:16,944 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - AttributeError: module 'torch.jit' has no attribute 'unused' 2022-06-08 15:28:16,943 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-08 15:28:16,945 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stderr 2022-06-08 15:28:16,945 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stdout 2022-06-08 15:28:16,945 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 89 seconds. 2022-06-08 15:28:16,946 [INFO ] W-9000-maskrcnn_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stderr 2022-06-08 15:28:16,946 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stdout 2022-06-08 15:29:45,985 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000 2022-06-08 15:29:45,985 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]23552 2022-06-08 15:29:45,985 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started. 2022-06-08 15:29:45,985 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.7.9 2022-06-08 15:29:45,985 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-08 15:29:45,985 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000 2022-06-08 15:29:45,986 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000. 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died. 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_loader.py", line 84, in load 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(module_name) 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/importlib/__init__.py", line 127, in import_module 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 1006, in _gcd_import 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 983, in _find_and_load 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'object_detector' 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred: 2022-06-08 15:29:46,131 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 176, in <module> 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server() 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 148, in run_server 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket) 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 112, in handle_connection 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg) 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_service_worker.py", line 85, in load_model 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size) 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/model_loader.py", line 88, in load 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(module_name, 'ts.torch_handler') 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/importlib/__init__.py", line 127, in import_module 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 1006, in _gcd_import 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 983, in _find_and_load 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 677, in _load_unlocked 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap_external>", line 728, in exec_module 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/ts/torch_handler/object_detector.py", line 5, in <module> 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from torchvision import transforms 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/__init__.py", line 6, in <module> 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from torchvision import models 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/__init__.py", line 5, in <module> 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from .inception import * 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/inception.py", line 63, in <module> 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - class Inception3(nn.Module): 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/qianzhi/anaconda3/envs/serve2/lib/python3.7/site-packages/torchvision/models/inception.py", line 195, in Inception3 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - @torch.jit.unused 2022-06-08 15:29:46,132 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - AttributeError: module 'torch.jit' has no attribute 'unused' 2022-06-08 15:29:46,133 [INFO ] epollEventLoopGroup-5-12 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-08 15:29:46,133 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-08 15:29:46,133 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2022-06-08 15:29:46,133 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: maskrcnn, error: Worker died. 2022-06-08 15:29:46,133 [DEBUG] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-maskrcnn_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-08 15:29:46,133 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stderr 2022-06-08 15:29:46,133 [WARN ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-maskrcnn_1.0-stdout 2022-06-08 15:29:46,134 [INFO ] W-9000-maskrcnn_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 144 seconds. 2022-06-08 15:29:46,135 [INFO ] W-9000-maskrcnn_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stderr 2022-06-08 15:29:46,135 [INFO ] W-9000-maskrcnn_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-maskrcnn_1.0-stdout

festinais commented 2 years ago

@bamboosdu the error in your case is happening for a different reason! it's not the memory leak issue..

festinais commented 2 years ago

@jagadeeshi2i I can't seem to find a kserve source that has torchserve 0.6.0. One idea I had in mind was to enter the container and install the torchserve 0.6.0 manually, however I'm not sure how to start torchserve inside the kserve env. manually because I can't find the MAR file in there....

Do you have any ideas?

festinais commented 2 years ago

or I will try to just change the version when installing kserve inside the kserve.yaml file.

pytorch": {
          "v1" : {
            "image": "kserve/torchserve-kfs",
            "defaultImageVersion": "0.5.3",
            "defaultGpuImageVersion": "0.5.3-gpu",
            "supportedFrameworks": [
              "pytorch"
            ],
            "multiModelServer": false
          },
          "v2" : {
            "image": "kserve/torchserve-kfs",
            "defaultImageVersion": "0.5.3",
            "defaultGpuImageVersion": "0.5.3-gpu",
            "supportedFrameworks": [
              "pytorch"
            ],
            "multiModelServer": false
          }
        }
festinais commented 2 years ago

also in the kserve-runtimes.yaml there is:

spec:
  containers:
  - args:
    - torchserve
    - --start
    - --model-store=/mnt/models/model-store
    - --ts-config=/mnt/models/config/config.properties
    env:
    - name: TS_SERVICE_ENVELOPE
      value: '{{.Labels.serviceEnvelope}}'
    image: kserve/torchserve-kfs:0.5.3
    name: kserve-container

I'm not sure if I change it to kserve/torchserve-kfs:0.6.0 that it would work in this case..

jagadeeshi2i commented 2 years ago

@festinais Pls try changing the images to pytorch/torchserve-kfs:0.6.0 both on runtimes and kserve. yaml.

festinais commented 2 years ago

@jagadeeshi2i I just updated the yaml files both kserve.yaml and kserve-runtimes.yaml with torchserve 0.6.0. I couldn't create the inferenceservice with kserve updated with these yaml files. The inferenceservice status was always FALSE , and there weren't any pods created for me to see logs.

Maybe it can not find the image: image: kserve/torchserve-kfs:0.6.0?

jagadeeshi2i commented 2 years ago

@festinais you have to use this image - https://hub.docker.com/layers/torchserve-kfs/pytorch/torchserve-kfs/0.6.0/images/sha256-43c1c14b2768c0cd73ce40e23da0a56176532b3c0fcb8b2217c087da3054c372?context=explore I have mentioned in the above comment.

festinais commented 2 years ago

thank you! it works now :) it was my mistake, i was referring to kserve/ instead of pytorch/..

jack-gits commented 2 years ago

@jagadeeshi2i I met the same error when I using FROM pytorch/torchserve:0.6.0-gpu. I need change to the version of pytorch/torchserve-kfs:0.6.0-gpu? what's the difference of them?