pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

On stopping torchserve number of errors get logged and program does not exit #208

Closed chauhang closed 4 years ago

chauhang commented 4 years ago

On stopping torchserve using torchserve --stop a number of errors get logged on the console and the program does not exit, one has to hit ctrl-c to kill the process. Errors are thrown on stopping even if there are no active inference requests on the server and it has been idle for few hours.

Errors:

torchserve --stop
TorchServe has stopped.
2020-04-17 05:34:17,093 [INFO ] epollEventLoopGroup-2-2 org.pytorch.serve.ModelServer - Management model server stopped.
2020-04-17 05:34:17,093 [INFO ] epollEventLoopGroup-2-1 org.pytorch.serve.ModelServer - Inference model server stopped.
(serve) ubuntu@ip-172-31-45-171:~/work$ 2020-04-17 05:34:17,326 [INFO ] epollEventLoopGroup-4-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2020-04-17 05:34:17,327 [WARN ] W-9000-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
java.lang.IllegalMonitorStateException
    at java.base/java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:149)
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1302)
    at java.base/java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:439)
    at org.pytorch.serve.wlm.Model.pollBatch(Model.java:175)
    at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-17 05:34:17,329 [DEBUG] W-9000-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_text_classifier_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2020-04-17 05:34:17,330 [INFO ] W-9000-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2020-04-17 05:34:17,518 [INFO ] epollEventLoopGroup-4-1 org.pytorch.serve.wlm.WorkerThread - 9001 Worker disconnected. WORKER_MODEL_LOADED
2020-04-17 05:34:17,518 [DEBUG] W-9001-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
    at java.base/java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513)
    at java.base/java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675)
    at org.pytorch.serve.wlm.Model.pollBatch(Model.java:155)
    at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-17 05:34:17,520 [DEBUG] W-9001-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-my_text_classifier_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2020-04-17 05:34:17,520 [INFO ] W-9001-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 1 seconds.
2020-04-17 05:34:18,398 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000
2020-04-17 05:34:18,398 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]8693
2020-04-17 05:34:18,398 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-04-17 05:34:18,398 [DEBUG] W-9000-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_text_classifier_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-04-17 05:34:18,398 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.8.2
2020-04-17 05:34:18,398 [INFO ] W-9000-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2020-04-17 05:34:18,400 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000.
2020-04-17 05:34:18,586 [INFO ] W-9001-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9001
2020-04-17 05:34:18,586 [INFO ] W-9001-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]8698
2020-04-17 05:34:18,586 [INFO ] W-9001-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-04-17 05:34:18,586 [INFO ] W-9001-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.8.2
2020-04-17 05:34:18,586 [DEBUG] W-9001-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-my_text_classifier_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-04-17 05:34:18,586 [INFO ] W-9001-my_text_classifier_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2020-04-17 05:34:18,588 [INFO ] W-9001-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9001.
2020-04-17 05:34:19,229 [INFO ] W-9000-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Model loaded on device cuda:1 
2020-04-17 05:34:19,253 [INFO ] W-9001-my_text_classifier_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Model loaded on device cuda:2 
2020-04-17 05:34:19,308 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.
aaronmarkham commented 4 years ago

Similar experience here:

(test_torchserve) ubuntu@ip-172-31-44-208:~/serve$ torchserve --stop
TorchServe has stopped.
2020-04-17 11:51:09,674 [INFO ] epollEventLoopGroup-4-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2020-04-17 11:51:09,676 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
    at java.base/java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513)
    at java.base/java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675)
    at org.pytorch.serve.wlm.Model.pollBatch(Model.java:155)
    at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-17 11:51:09,683 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2020-04-17 11:51:09,679 [INFO ] epollEventLoopGroup-2-2 org.pytorch.serve.ModelServer - Management model server stopped.
2020-04-17 11:51:09,679 [INFO ] epollEventLoopGroup-2-1 org.pytorch.serve.ModelServer - Inference model server stopped.
2020-04-17 11:51:09,688 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
(test_torchserve) ubuntu@ip-172-31-44-208:~/serve$ 2020-04-17 11:51:09,704 [INFO ] epollEventLoopGroup-4-1 org.pytorch.serve.wlm.WorkerThread - 9001 Worker disconnected. WORKER_MODEL_LOADED
2020-04-17 11:51:09,705 [DEBUG] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
    at java.base/java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513)
    at java.base/java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675)
    at org.pytorch.serve.wlm.Model.pollBatch(Model.java:155)
    at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
    at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-17 11:51:09,705 [DEBUG] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-densenet161_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2020-04-17 11:51:09,706 [INFO ] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 1 seconds.
2020-04-17 11:51:09,852 [INFO ] pool-2-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,853 [INFO ] pool-2-thread-2 TS_METRICS - DiskAvailable.Gigabytes:16.475448608398438|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,853 [INFO ] pool-2-thread-2 TS_METRICS - DiskUsage.Gigabytes:70.69266128540039|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,854 [INFO ] pool-2-thread-2 TS_METRICS - DiskUtilization.Percent:81.1|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,855 [INFO ] pool-2-thread-2 TS_METRICS - MemoryAvailable.Megabytes:3382.79296875|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,855 [INFO ] pool-2-thread-2 TS_METRICS - MemoryUsed.Megabytes:300.98046875|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,856 [INFO ] pool-2-thread-2 TS_METRICS - MemoryUtilization.Percent:14.2|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,872 [ERROR] Thread-4 org.pytorch.serve.metrics.MetricCollector - --- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_common.py", line 449, in wrapper
    ret = self._cache[fun]
AttributeError: _cache

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1516, in wrapper
    return fun(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_common.py", line 452, in wrapper
    return fun(self)
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1558, in _parse_stat_file
    with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_common.py", line 713, in open_binary
    return open(fname, "rb", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/21880/stat'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 373, in _init
    self.create_time()
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 723, in create_time
    self._create_time = self._proc.create_time()
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1516, in wrapper
    return fun(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1728, in create_time
    ctime = float(self._parse_stat_file()['create_time'])
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1523, in wrapper
    raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=21880)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 20, in get_cpu_usage
    process = psutil.Process(int(pid))
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 346, in __init__
    self._init(pid)
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 386, in _init
    raise NoSuchProcess(pid, None, msg)
psutil.NoSuchProcess: psutil.NoSuchProcess no process found with pid 21880

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "ts/metrics/metric_collector.py", line 18, in <module>
    check_process_mem_usage(sys.stdin)
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 40, in check_process_mem_usage
    logging.info("%s:%d", process, get_cpu_usage(process))
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 22, in get_cpu_usage
    logging.error("Failed get process for pid: %s", pid, exc_info=True)
Message: 'Failed get process for pid: %s'
Arguments: ('21880',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "ts/metrics/metric_collector.py", line 18, in <module>
    check_process_mem_usage(sys.stdin)
  File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 40, in check_process_mem_usage
    logging.info("%s:%d", process, get_cpu_usage(process))
Message: '%s:%d'
Arguments: ('21880', 0)
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>
BrokenPipeError: [Errno 32] Broken pipe

2020-04-17 11:51:10,794 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000
2020-04-17 11:51:10,795 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]22040
2020-04-17 11:51:10,795 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-04-17 11:51:10,795 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-04-17 11:51:10,797 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2020-04-17 11:51:10,797 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-04-17 11:51:10,800 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000.
2020-04-17 11:51:10,805 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9001
2020-04-17 11:51:10,805 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]22043
2020-04-17 11:51:10,806 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-04-17 11:51:10,806 [DEBUG] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-densenet161_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-04-17 11:51:10,806 [INFO ] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2020-04-17 11:51:10,806 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-04-17 11:51:10,808 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9001.
2020-04-17 11:51:11,913 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.

(test_torchserve) ubuntu@ip-172-31-44-208:~/serve$ 
harshbafna commented 4 years ago

This is duplicate to #155

fbbradheintz commented 4 years ago

Yes - this was the same thing I had noticed in #155. I did note that it did not seem to stop the TorchServe from emitting a correct configuration snapshot, nor did it stop it from restarting and operating correctly afterward.

@chauhang - you might want to re-check your ^C experience. The TS log tail you get in the terminal does not interrupt normal operation of the command line. As far as your bash session is concerned, it's still waiting for input from the prompt it put up after you did torchserve --start. You can actually issue shell commands from that same session, even as the TorchServe log tail is scrolling by. (Try an ls to verify.) This means that when you get the final main org.pytorch.serve.ModelServer - Torchserve stopped. message, TorchServe really is stopped, you don't need to hit ^C. You're not seeing a command prompt because bash is still waiting for input from the last one. Hit Return and you should get a prompt.

CerebralSeed commented 2 years ago

I'm not sure if this is the exact issue of the op, but I found going through the model_log.log file in the /logs folder that Torch Serve doesn't like "*args, **kwargs" in the model's defining parameters. I found after deleting these, that the error went away.