Closed chauhang closed 4 years ago
Similar experience here:
(test_torchserve) ubuntu@ip-172-31-44-208:~/serve$ torchserve --stop
TorchServe has stopped.
2020-04-17 11:51:09,674 [INFO ] epollEventLoopGroup-4-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2020-04-17 11:51:09,676 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
at java.base/java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513)
at java.base/java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675)
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:155)
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-17 11:51:09,683 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2020-04-17 11:51:09,679 [INFO ] epollEventLoopGroup-2-2 org.pytorch.serve.ModelServer - Management model server stopped.
2020-04-17 11:51:09,679 [INFO ] epollEventLoopGroup-2-1 org.pytorch.serve.ModelServer - Inference model server stopped.
2020-04-17 11:51:09,688 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
(test_torchserve) ubuntu@ip-172-31-44-208:~/serve$ 2020-04-17 11:51:09,704 [INFO ] epollEventLoopGroup-4-1 org.pytorch.serve.wlm.WorkerThread - 9001 Worker disconnected. WORKER_MODEL_LOADED
2020-04-17 11:51:09,705 [DEBUG] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
at java.base/java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513)
at java.base/java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675)
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:155)
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-17 11:51:09,705 [DEBUG] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-densenet161_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2020-04-17 11:51:09,706 [INFO ] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 1 seconds.
2020-04-17 11:51:09,852 [INFO ] pool-2-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,853 [INFO ] pool-2-thread-2 TS_METRICS - DiskAvailable.Gigabytes:16.475448608398438|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,853 [INFO ] pool-2-thread-2 TS_METRICS - DiskUsage.Gigabytes:70.69266128540039|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,854 [INFO ] pool-2-thread-2 TS_METRICS - DiskUtilization.Percent:81.1|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,855 [INFO ] pool-2-thread-2 TS_METRICS - MemoryAvailable.Megabytes:3382.79296875|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,855 [INFO ] pool-2-thread-2 TS_METRICS - MemoryUsed.Megabytes:300.98046875|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,856 [INFO ] pool-2-thread-2 TS_METRICS - MemoryUtilization.Percent:14.2|#Level:Host|#hostname:ip-172-31-44-208,timestamp:1587124269
2020-04-17 11:51:09,872 [ERROR] Thread-4 org.pytorch.serve.metrics.MetricCollector - --- Logging error ---
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_common.py", line 449, in wrapper
ret = self._cache[fun]
AttributeError: _cache
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1516, in wrapper
return fun(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_common.py", line 452, in wrapper
return fun(self)
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1558, in _parse_stat_file
with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_common.py", line 713, in open_binary
return open(fname, "rb", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/21880/stat'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 373, in _init
self.create_time()
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 723, in create_time
self._create_time = self._proc.create_time()
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1516, in wrapper
return fun(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1728, in create_time
ctime = float(self._parse_stat_file()['create_time'])
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/_pslinux.py", line 1523, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=21880)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 20, in get_cpu_usage
process = psutil.Process(int(pid))
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 346, in __init__
self._init(pid)
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/psutil/__init__.py", line 386, in _init
raise NoSuchProcess(pid, None, msg)
psutil.NoSuchProcess: psutil.NoSuchProcess no process found with pid 21880
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
File "ts/metrics/metric_collector.py", line 18, in <module>
check_process_mem_usage(sys.stdin)
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 40, in check_process_mem_usage
logging.info("%s:%d", process, get_cpu_usage(process))
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 22, in get_cpu_usage
logging.error("Failed get process for pid: %s", pid, exc_info=True)
Message: 'Failed get process for pid: %s'
Arguments: ('21880',)
--- Logging error ---
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
File "ts/metrics/metric_collector.py", line 18, in <module>
check_process_mem_usage(sys.stdin)
File "/home/ubuntu/anaconda3/envs/test_torchserve/lib/python3.6/site-packages/ts/metrics/process_memory_metric.py", line 40, in check_process_mem_usage
logging.info("%s:%d", process, get_cpu_usage(process))
Message: '%s:%d'
Arguments: ('21880', 0)
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>
BrokenPipeError: [Errno 32] Broken pipe
2020-04-17 11:51:10,794 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000
2020-04-17 11:51:10,795 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]22040
2020-04-17 11:51:10,795 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-04-17 11:51:10,795 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-04-17 11:51:10,797 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2020-04-17 11:51:10,797 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-04-17 11:51:10,800 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000.
2020-04-17 11:51:10,805 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9001
2020-04-17 11:51:10,805 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]22043
2020-04-17 11:51:10,806 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-04-17 11:51:10,806 [DEBUG] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-densenet161_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-04-17 11:51:10,806 [INFO ] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2020-04-17 11:51:10,806 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-04-17 11:51:10,808 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9001.
2020-04-17 11:51:11,913 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.
(test_torchserve) ubuntu@ip-172-31-44-208:~/serve$
This is duplicate to #155
Yes - this was the same thing I had noticed in #155. I did note that it did not seem to stop the TorchServe from emitting a correct configuration snapshot, nor did it stop it from restarting and operating correctly afterward.
@chauhang - you might want to re-check your ^C
experience. The TS log tail you get in the terminal does not interrupt normal operation of the command line. As far as your bash session is concerned, it's still waiting for input from the prompt it put up after you did torchserve --start
. You can actually issue shell commands from that same session, even as the TorchServe log tail is scrolling by. (Try an ls
to verify.) This means that when you get the final main org.pytorch.serve.ModelServer - Torchserve stopped.
message, TorchServe really is stopped, you don't need to hit ^C
. You're not seeing a command prompt because bash is still waiting for input from the last one. Hit Return and you should get a prompt.
I'm not sure if this is the exact issue of the op, but I found going through the model_log.log file in the /logs folder that Torch Serve doesn't like "*args, **kwargs" in the model's defining parameters. I found after deleting these, that the error went away.
On stopping torchserve using
torchserve --stop
a number of errors get logged on the console and the program does not exit, one has to hit ctrl-c to kill the process. Errors are thrown on stopping even if there are no active inference requests on the server and it has been idle for few hours.Errors: