pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.07k stars 828 forks source link

Worker unload/start loop #2428

Open sreeprasannar opened 1 year ago

sreeprasannar commented 1 year ago

Why would torchserve workers restart over and over? I'm using torchserve docker image version 0.8.0

2023-06-20T23:15:44,435 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2899
2023-06-20T23:15:44,436 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:44,436 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:44,438 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:15:46,800 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2906
2023-06-20T23:15:46,801 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2906
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:46,811 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:15:50,202 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2912
2023-06-20T23:15:50,203 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:15:50,211 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2912
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:15:54,592 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2918
2023-06-20T23:15:54,593 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2918
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:54,605 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:00,982 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2924
2023-06-20T23:16:00,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2924
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:00,993 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:10,402 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2930
2023-06-20T23:16:10,403 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2930
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:10,413 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:24,789 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2936
2023-06-20T23:16:24,790 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2936
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:47,198 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2942
2023-06-20T23:16:47,199 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2942
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:47,209 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:17:22,596 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2948
2023-06-20T23:17:22,597 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2948
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:17:22,607 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:18:18,973 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2954
2023-06-20T23:18:18,973 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:18:18,982 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2954
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:19:49,389 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2961
2023-06-20T23:19:49,389 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2961
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:19:49,399 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:22:14,772 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2993
2023-06-20T23:22:14,773 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2993
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:22:14,783 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
msaroufim commented 1 year ago

It's usually either a

  1. model failure: does anything interesting show up in your logs/model_log.log?
  2. config problem: what does your config.properties look like
  3. Resource constraint problem: Did you give docker enough CPU and RAM?
sreeprasannar commented 1 year ago

interesting thing is I do see this in ts_log.log at the same times where model_log.log workers get unloaded:

2023-06-20T23:22:14,783 [DEBUG] W-9000-auto_categorization_0.1 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: DefaultChannelPromise@7698e29a(failure: io.netty.channel.StacklessClosedChannelException)
    at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
    at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:201) [model-server.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-20T23:22:14,783 [DEBUG] W-9000-auto_categorization_0.1 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: DefaultChannelPromise@7698e29a(failure: io.netty.channel.StacklessClosedChannelException)
    at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
    at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
    at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:201) [model-server.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
sreeprasannar commented 1 year ago

It seems related to https://github.com/pytorch/serve/issues/2357

sreeprasannar commented 1 year ago

TS uses a back_off to schedule worker recovery. You can see the recovery schedule in the log. When a worker die, TS is trying to recover a backend worker immediately in the first 5 rounds. The worker will try to load the model into memory immediately. That's why the memory seems not to be released. So it seems to wait for 2 minutes for the worker to start and then decides it's unresponsive and kills it. Happens again and again is my guess..

  • I believe my inference code indeed has a slow startup time so I could try optimizing that.
  • I could also increase the timeout? Currently it might be about 120 seconds, maybe I could increase that
sreeprasannar commented 1 year ago

The main issue is that I have install_py_dep_per_model=True because I have requirements that need to be installed for the model. That seems to be the main reason for the slow startup time.

msaroufim commented 1 year ago

Yeah that flag makes everything so slow. I'd suggest either consolidating your dependencies for all models or passing in your venv in extra files

thirtyninetythree commented 1 year ago

I have this same error but on a VM(not using docker) torchserve==0.8.1

class DiffusersHandler(BaseHandler, ABC):
    """
    Diffusers handler class for text to image generation.
    """
    def __init__(self):
        super(DiffusersHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        """In this initialize function, the Stable Diffusion model is loaded and
        initialized here.
        Args:
            ctx (context): It is a JSON Object containing information
            pertaining to the model artefacts parameters.
        """
        controlnet = ControlNetModel.from_pretrained(
            "./stable-diffusion/controlnet", torch_dtype=torch.float16
        )
        pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
            "./stable-diffusion",
            controlnet=controlnet,
            safety_checker=None,
            torch_dtype=torch.float16,
        )
        pipe.to("cuda")

        self.initialized = True

Logs show worker/start stop:

2023-07-02T09:24:04,960 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-07-02T09:24:04,960 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?]
    at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:415) ~[model-server.jar:?]
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?]
    at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:415) ~[model-server.jar:?]
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
2023-07-02T09:24:04,962 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-07-02T09:24:04,962 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-07-02T09:24:04,962 [WARN ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-07-02T09:24:04,962 [WARN ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again

This is following the torchserve diffusers tutorial here