Open sreeprasannar opened 1 year ago
It's usually either a
logs/model_log.log
?config
problem: what does your config.properties
look likeinteresting thing is I do see this in ts_log.log
at the same times where model_log.log
workers get unloaded:
2023-06-20T23:22:14,783 [DEBUG] W-9000-auto_categorization_0.1 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: DefaultChannelPromise@7698e29a(failure: io.netty.channel.StacklessClosedChannelException)
at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:201) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-20T23:22:14,783 [DEBUG] W-9000-auto_categorization_0.1 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: DefaultChannelPromise@7698e29a(failure: io.netty.channel.StacklessClosedChannelException)
at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:201) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
It seems related to https://github.com/pytorch/serve/issues/2357
TS uses a back_off to schedule worker recovery. You can see the recovery schedule in the log. When a worker die, TS is trying to recover a backend worker immediately in the first 5 rounds. The worker will try to load the model into memory immediately. That's why the memory seems not to be released. So it seems to wait for 2 minutes for the worker to start and then decides it's unresponsive and kills it. Happens again and again is my guess..
- I believe my inference code indeed has a slow startup time so I could try optimizing that.
- I could also increase the timeout? Currently it might be about 120 seconds, maybe I could increase that
The main issue is that I have install_py_dep_per_model=True
because I have requirements that need to be installed for the model. That seems to be the main reason for the slow startup time.
Yeah that flag makes everything so slow. I'd suggest either consolidating your dependencies for all models or passing in your venv in extra files
I have this same error but on a VM(not using docker) torchserve==0.8.1
class DiffusersHandler(BaseHandler, ABC):
"""
Diffusers handler class for text to image generation.
"""
def __init__(self):
super(DiffusersHandler, self).__init__()
self.initialized = False
def initialize(self, ctx):
"""In this initialize function, the Stable Diffusion model is loaded and
initialized here.
Args:
ctx (context): It is a JSON Object containing information
pertaining to the model artefacts parameters.
"""
controlnet = ControlNetModel.from_pretrained(
"./stable-diffusion/controlnet", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
"./stable-diffusion",
controlnet=controlnet,
safety_checker=None,
torch_dtype=torch.float16,
)
pipe.to("cuda")
self.initialized = True
Logs show worker/start stop:
2023-07-02T09:24:04,960 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-07-02T09:24:04,960 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:415) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:415) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-07-02T09:24:04,962 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-07-02T09:24:04,962 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-07-02T09:24:04,962 [WARN ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-07-02T09:24:04,962 [WARN ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
This is following the torchserve diffusers tutorial here
Why would torchserve workers restart over and over? I'm using torchserve docker image version 0.8.0