pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

Torchserve: Listening on port: None - Backend worker monitoring thread interrupted or backend worker process died #1266

Closed benjtlv closed 3 years ago

benjtlv commented 3 years ago

Hello.

I have been using torchserve for my team for 2 months now. I am working on windows 10, using python 3.9.6 virtual env, and run torch-model-archiver and torchserve inside this envrionment

Things worked really well for a time, but I recently came back working on torchserve again and to my surprise I know fall on the following issue:

(venv) C:\Users\Benjamin Elkrieff\Workspace\hexa_algo>2021-10-05 11:03:55,785 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2021-10-05 11:03:56,030 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.4.2 TS Home: C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages Current directory: C:\Users\Benjamin Elkrieff\Workspace\hexa_algo Temp directory: C:\Users\BENJAM~1\AppData\Local\Temp Number of GPUs: 1 Number of CPUs: 16 Max heap size: 8044 M Python executable: c:\users\benjamin elkrieff\workspace\hexa_algo\venv\scripts\python.exe Config file: .\model_serving\product_image_embedder\config\config.properties Inference address: http://0.0.0.0:8080 Management address: http://0.0.0.0:8081 Metrics address: http://0.0.0.0:8082 Model Store: C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\model_serving\product_image_embedder\model-store Initial Models: image_embedder.mar Log dir: C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\logs Metrics dir: C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\logs Netty threads: 32 Netty client threads: 0 Default workers per model: 1 Blacklist Regex: N/A Maximum Response Size: 65535000 Maximum Request Size: 6553500 Prefer direct buffer: false Allowed Urls: [file://.*|http(s)?://.*] Custom python dependency for model allowed: true Metrics report format: prometheus Enable metrics API: true Workflow Store: C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\model_serving\product_image_embedder\model-store Model config: N/A 2021-10-05 11:03:56,038 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2021-10-05 11:03:56,060 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: image_embedder.mar 2021-10-05 11:03:57,466 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 0.0 for model image_embedder 2021-10-05 11:03:57,467 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 0.0 for model image_embedder 2021-10-05 11:03:59,665 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model image_embedder loaded. 2021-10-05 11:03:59,665 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: image_embedder, count: 1 2021-10-05 11:03:59,677 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: NioServerSocketChannel. 2021-10-05 11:03:59,788 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080 2021-10-05 11:03:59,788 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: NioServerSocketChannel. 2021-10-05 11:03:59,792 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081 2021-10-05 11:03:59,793 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: NioServerSocketChannel. 2021-10-05 11:03:59,794 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 Model server started. 2021-10-05 11:04:00,012 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. 2021-10-05 11:04:00,134 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:LAPTOP-5NOJI4TU,timestamp:1633428240 2021-10-05 11:04:00,134 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:503.54068756103516|#Level:Host|#hostname:LAPTOP-5NOJI4TU,timestamp:1633428240 2021-10-05 11:04:00,136 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:429.67837142944336|#Level:Host|#hostname:LAPTOP-5NOJI4TU,timestamp:1633428240 2021-10-05 11:04:00,138 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:46.0|#Level:Host|#hostname:LAPTOP-5NOJI4TU,timestamp:1633428240 2021-10-05 11:04:00,139 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:8344.05078125|#Level:Host|#hostname:LAPTOP-5NOJI4TU,timestamp:1633428240 2021-10-05 11:04:00,140 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:23829.578125|#Level:Host|#hostname:LAPTOP-5NOJI4TU,timestamp:1633428240 2021-10-05 11:04:00,141 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:74.1|#Level:Host|#hostname:LAPTOP-5NOJI4TU,timestamp:1633428240 2021-10-05 11:04:00,418 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Listening on port: None 2021-10-05 11:04:00,421 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - [PID]31548 2021-10-05 11:04:00,421 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Torch worker started. 2021-10-05 11:04:00,422 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-image_embedder_0.0 State change null -> WORKER_STARTED 2021-10-05 11:04:00,423 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Python runtime: 3.9.6 2021-10-05 11:04:00,427 [INFO ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /127.0.0.1:9000 2021-10-05 11:04:00,439 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Connection accepted: ('127.0.0.1', 9000). 2021-10-05 11:04:00,467 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - model_name: image_embedder, batchSize: 1 2021-10-05 11:04:00,674 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Backend worker process died. 2021-10-05 11:04:00,675 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Traceback (most recent call last): 2021-10-05 11:04:00,676 [INFO ] nioEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2021-10-05 11:04:00,677 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 183, in <module> 2021-10-05 11:04:00,679 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2021-10-05 11:04:00,680 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - worker.run_server() 2021-10-05 11:04:00,681 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 155, in run_server 2021-10-05 11:04:00,682 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - self.handle_connection(cl_socket) 2021-10-05 11:04:00,681 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1668) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:188) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:831) 2021-10-05 11:04:00,683 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 117, in handle_connection 2021-10-05 11:04:00,684 [WARN ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: image_embedder, error: Worker died. 2021-10-05 11:04:00,685 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - service, result, code = self.load_model(msg) 2021-10-05 11:04:00,685 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-image_embedder_0.0 State change WORKER_STARTED -> WORKER_STOPPED 2021-10-05 11:04:00,686 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 90, in load_model 2021-10-05 11:04:00,687 [WARN ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-image_embedder_0.0-stderr 2021-10-05 11:04:00,687 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope) 2021-10-05 11:04:00,688 [WARN ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-image_embedder_0.0-stdout 2021-10-05 11:04:00,689 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "c:\users\benjamin elkrieff\workspace\hexa_algo\venv\lib\site-packages\ts\model_loader.py", line 81, in load 2021-10-05 11:04:00,690 [INFO ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2021-10-05 11:04:00,690 [INFO ] W-9000-image_embedder_0.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-image_embedder_0.0-stderr 2021-10-05 11:04:00,692 [INFO ] W-9000-image_embedder_0.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-image_embedder_0.0-stdout 2021-10-05 11:04:02,402 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Listening on port: None 2021-10-05 11:04:02,404 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - [PID]31556 2021-10-05 11:04:02,404 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Torch worker started. 2021-10-05 11:04:02,404 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-image_embedder_0.0 State change WORKER_STOPPED -> WORKER_STARTED 2021-10-05 11:04:02,407 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Python runtime: 3.9.6 2021-10-05 11:04:02,408 [INFO ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /127.0.0.1:9000 2021-10-05 11:04:02,411 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Connection accepted: ('127.0.0.1', 9000). 2021-10-05 11:04:02,421 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - model_name: image_embedder, batchSize: 1 2021-10-05 11:04:02,609 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Backend worker process died. 2021-10-05 11:04:02,609 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - Traceback (most recent call last): 2021-10-05 11:04:02,610 [INFO ] nioEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2021-10-05 11:04:02,612 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 183, in <module> 2021-10-05 11:04:02,613 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2021-10-05 11:04:02,614 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - worker.run_server() 2021-10-05 11:04:02,615 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1668) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:188) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:831) 2021-10-05 11:04:02,615 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 155, in run_server 2021-10-05 11:04:02,616 [WARN ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: image_embedder, error: Worker died. 2021-10-05 11:04:02,617 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - self.handle_connection(cl_socket) 2021-10-05 11:04:02,618 [DEBUG] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - W-9000-image_embedder_0.0 State change WORKER_STARTED -> WORKER_STOPPED 2021-10-05 11:04:02,618 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 117, in handle_connection 2021-10-05 11:04:02,619 [WARN ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-image_embedder_0.0-stderr 2021-10-05 11:04:02,620 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - service, result, code = self.load_model(msg) 2021-10-05 11:04:02,621 [WARN ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-image_embedder_0.0-stdout 2021-10-05 11:04:02,621 [INFO ] W-9000-image_embedder_0.0-stdout MODEL_LOG - File "C:\Users\Benjamin Elkrieff\Workspace\hexa_algo\venv\lib\site-packages\ts\model_service_worker.py", line 90, in load_model 2021-10-05 11:04:02,622 [INFO ] W-9000-image_embedder_0.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-image_embedder_0.0-stderr 2021-10-05 11:04:02,622 [INFO ] W-9000-image_embedder_0.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2021-10-05 11:04:02,624 [INFO ] W-9000-image_embedder_0.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-image_embedder_0.0-stdout

I have to say that I deployed the same model in production and that it works. I also tried to come back to an old commit from a month ago, and it still doesn't work on my laptop I also tried to recreate the venv, with all the dependency, and the same problem arises. Even more crazy, I asked my colegue to run the same thing on his laptop, and it also fails. But it works in production (k8s)

The commands I am running:

torch-model-archiver -f --model-name image_embedder --version 0.0 --serialized-file .\model_serving\product_image_embedder\models\inc3_embedder\inc3_ms.pt --handler .\model_serving\product_image_embedder\handler\model_handler.py --extra-files .\utils --export-path .\model_serving\product_image_embedder\model-store -r requirements-torch.txt

torchserve --start --ncs --model-store .\model_serving\product_image_embedder\model-store --models image_embedder.mar --ts-config .\model_serving\product_image_embedder\config\config.properties

It seems like there is a problem with the port being None instead of something like /tmp/.ts.sock.9000 so I assume I have a problem in some hidden configuration.

I think helping me here would also help tons of people

msaroufim commented 3 years ago

It seems like you're seeing some issue with the batch aggregator which is weird because according to your config.properties your batch size is equal to 1

You're also seeing some sort of concurrency issue with the batch aggregator which is an issue we've seen and are close to merging with this PR https://github.com/pytorch/serve/pull/1244 - if you have any feedback on whether this fixes your problem please let me know.

Is there any chance you can share a container or onedrive file so I can run the example end to end and validate the issue?

In your config.properties once suspicion thing is that netty_client_threads = 0 where it should be the default number of logical processors available to the JVM. Just double checking that you have Java installed using python ts_scripts/install_dependencies.pywhich should include Java 11

benjtlv commented 3 years ago

Hey Mark!

Thank you for the fast answer! I actually resolved my problem, and it's so stupid you would just beat your head. I actually use a custom handler and read some env vars. The thing is that the env vars weren't set in my terminal.

So After I set the env vars everything was okay.

But not being able to see this in the logs, was very confusing.

So maybe that's something that should be taken care of.

Maybe we can see if I can contribute to this.

By the way, I wanted to write another issue for a second matter, it's about log4j using an external logging service (coralogix)

So I'll just open another issue.

Thank you for your time

msaroufim commented 3 years ago

That's good to know, my suggestion would be to have a line somewhere in your handler that says

if "FOO" in os.environ:
    print("FOO not found in env variables")

Which will show up in your model_log.log

jessdaubner commented 2 years ago

That's good to know, my suggestion would be to have a line somewhere in your handler that says

if "FOO" in os.environ:
    print("FOO not found in env variables")

Which will show up in your model_log.log

I found this issue helpful for debugging issues with a custom handler. Thank you! Other than using print() in the custom handler, is is possible to pass the logging statements from a custom handler into model_log.log?