Closed cnicodeme closed 1 year ago
You make subprocess calls that block the execution. With an async server like Sanic you should await
on calls that may take time to finish, and this you cannot do with subprocess.call
.
You will need an alternative way of running your commands using async/await.
Do you have any suggestions on how to make these calls ?
Also, on my other server running Sanic that freeze, there are no call to subprocess.call
but it's more complex (websockets, calls to database, Redis and the likes)
One alternative: https://docs.python.org/3/library/asyncio-subprocess.html
Similarly with Redis and any other database etc, make sure that all your calls are asynchronous. Some examples with aioredis https://aioredis.readthedocs.io/en/latest/getting-started/#response-decoding
Btw, I used Bing Chat to find these answers quickly. I suggest you get ChatGPT, Bing Chat or some other chatbot for such quick answers to difficult problems. Ask them for async/await alternative to what you are currently using. ChatGPT might not be ideal because it doesn't know of any new libraries (its knowledge is cut off at year 2021), so you.com, Github Copilot or Bing chats might be preferable.
Thank you I'll try. But I'm not sure this will resolve the issue (I've been experiencing this "hanging" issue across many Sanic instances on various servers)
It will hang if you do anything that takes time without await
on the blocking call (requests from database, making http requests, file I/O and running external commands being prime examples of such things). If you make lengthy calculation or don't have a library available that would use async/await you can use a thread to do this:
await asyncio.get_event_loop().run_in_executor(None, your_function, args...)
Yes, using asyncio subprocess is better because it will yield back to the loop. But, that alone is not necessarily enough. If you have nothing else to yield to and you're awaiting that response it it is still blocking. This is why io is best in loops with timeouts, etc.
If you are making lengthy subprocess calls, I suggest another pattern of pushing off background work that can be checked in a loop, or that can be sent to another process.
One thing that might worth be sharing, but the script above runs well and respond to the server properly.
Then, a few weeks after (yes, weeks), that scripts stops responding.
I can imagine that calling a subprocess
then trying to stop the code right after might cause it to not respond, but if the command is finished, the server should behave correctly no? So trying to send another POST request 2 weeks later should work?
(Or does calling the subprocess
modify the state in a way that it makes subsequent request fails, no matter when)?
Wait, are you trying to live update a running instance? Or, is that meant to be updating a different script/env/application?
So it does not sound like what you are saying is that the subprocess call is blocking other requests from coming, but the application is just dead?
There is no inherent problem in using subprocess.call
(whether the regular or async version). To prove the point:
import subprocess
import sys
from sanic import Request, Sanic, json
app = Sanic("TestApp")
@app.get("/")
async def handler(request: Request):
assert (
subprocess.call(
"ls -l",
shell=True,
cwd="./",
stderr=sys.stdout,
stdout=sys.stdout,
)
== 0
), "Unable to show the directory"
return json({"foo": "bar"})
Yes, the application is dead, but not immediately, more after a few weeks (using or not using it).
I'll try to be as verbose as possible to help.
I'm using Sanic on two projects, GetFernand.com and PDFShift.io. On Fernand, I also use Websockets. Both are using SQLAlchemy in async mode, and Redis (in case that helps).
For both these services, I've also created the "server.py" script above that is completely independent (but use the same virtual environment). The purpose of this server.py script is to update the packages, update the database, update the code and trigger a reload whenever necessary. (IF "requirements.txt" has changed, we trigger an package update. If there is new files in the migrations folder, we trigger a database update, etc).
The code works fine via Github, and I can trigger a few push notification without any problems. But after a few weeks (generally around 2 weeks), any Sanic instance stops responding.
I've shared the server.py script because I have the same issue here, and it's a pretty basic one. I do have the same issues with the main applications at Fernand and PDFShift but the code is way more complex so it would be more problematic to debug it.
PDFShift & Fernand are on different servers (all Debian, mostly 11). Fernand is on v22.12 while PDFShift is on v22.9.
I recently created a copy of the Fernand code to run in another server for our demo. In this case, the Redis and MySQL server are all local. Every once in a while, the server stops accepting requests and again, Sanic is frozen.
I have a Caddy server in front that dispatch to Sanic. Caddy works, and the logs displays a 502 when trying to reach out to the Sanic server.
I hope this brings more details to it. As I've said, I unfortunately have no output I can recover from Sanic that show a specific error that could explain it. I've been digging the logs with no luck.
Do you have anything that is checking for health? Pinging the application every X seconds for a response.
I wonder if this would be helpful: https://sanic.dev/en/plugins/sanic-ext/health-monitor.html
Also, when the app is dead, does that just mean it no longer is accepting external requests? What if you SSH into the machine and try pinging it directly?
I just enabled access_log to see if Caddy is pinging the app and it's not the case (the access log is properly updated because if I load a page manually, it shows the request). So I'd say there is no health check from Caddy.
(There is no healthcheck, even outside or scripts for the above app)
When the app is dead, the server still respond, I can SSH into it, but I tried once to do a cURL command, and it failed. (I can recall if it failed after a timeout or immediately though, but I recall I had no errors written and no content returned).
Also, one of my big application is running websockets, and I noticed that somehow, one of the dev team had kept their tab open, and a request was done every ~100seconds to the wss endpoint. The response had 0 bytes returned but then, at some point, it failed and returned a 502 (from Caddy).
Again, looking at the logs shows nothing, but it seems that the server stops accepting requests at some point.
Also, after restarting the server, the websocket endpoint doesn't respond for a while. From what I was able to find, it was still processing previous pending requests.
I gotta say this is kind of blurry to me and I don't have a clear pattern I can give.
One of my server just got down. I'll update here with some findings but if you want me to run some commands, let me know!
Ok, first, when trying to load my /ping endpoint, it takes around 3seconds then fail. Here's the log from Caddy:
{
"level": "error",
"ts": 1678454660.1380699,
"logger": "http.log.access.log0",
"msg": "handled request",
"request": {
"remote_ip": "my_ip_v6",
"remote_port": "56283",
"proto": "HTTP/3.0",
"method": "GET",
"host": "my.server.dev",
"uri": "/ping",
"headers": {
"Dnt": ["1"],
"Sec-Fetch-Mode": ["navigate"],
"Sec-Fetch-Site": ["none"],
"Sec-Fetch-User": ["?1"],
"Accept-Encoding": ["gzip, deflate, br"],
"Sec-Ch-Ua": ["\"Chromium\";v=\"110\", \"Not A(Brand\";v=\"24\", \"Brave\";v=\"110\""],
"Upgrade-Insecure-Requests": ["1"],
"User-Agent": ["Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"],
"Sec-Gpc": ["1"],
"Sec-Fetch-Dest": ["document"],
"Sec-Ch-Ua-Mobile": ["?0"],
"Sec-Ch-Ua-Platform": ["\"Linux\""],
"Accept": ["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8"],
"Accept-Language": ["fr-FR,fr;q=0.9"]
},
"tls": {
"resumed": false,
"version": 0,
"cipher_suite": 0,
"proto": "",
"server_name": ""
}
},
"user_id": "",
"duration": 3.001652863,
"size": 0,
"status": 502,
"resp_headers": {
"Server": ["Caddy"]
}
}
We can notice the last parts: "duration":3.001652863,"size":0,"status":502
The local server is running on the port 5000. In the server, I'm running curl --trace - "http://127.0.0.1:5000/ping"
. Here's the output:
== Info: Trying 127.0.0.1:5000...
That's all. I'm guessing the 3s from Caddy is the timeout to connecting to the server.
Good news, Sanic:Inspector is running! (I had enabled it last time it failed)
#> curl http://localhost:6457
{
"meta": {
"action": "info"
},
"result": {
"info": {
"mode": "production, goin' fast single worker",
"server": "unknown",
"python": "3.9.2",
"platform": "Linux-5.10.0-19-cloud-amd64-x86_64-with-glibc2.31",
"packages": ["sanic==22.12.0", "sanic-routing==22.8.0"],
"extra": {}
},
"workers": {
"Sanic-Main": {
"pid": 754265
},
"Sanic-Server-0-0": {
"server": true,
"state": "ACKED",
"pid": 756031,
"start_at": "2023-03-09T17:09:17.486716+00:00",
"starts": 6,
"restart_at": "2023-03-09T19:48:11.687590+00:00"
},
"Sanic-Inspector-0": {
"server": false,
"state": "STARTED",
"pid": 754279,
"start_at": "2023-03-09T17:09:17.503364+00:00",
"starts": 1
}
}
}
}
(I don't know if that helps)
I could try a POST /reload
but I first want to wait if you want to try some things.
I'll continue to update this ticket along with my findings.
Here's some errors I got on /var/log/supervisor/fernand-stderr-... :
[2023-03-09 17:09:08 +0000] [658080] [ERROR] Experienced exception while trying to serve
Traceback (most recent call last):
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/mixins/startup.py", line 862, in serve
manager.run()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 95, in run
self.monitor()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 197, in monitor
self._sync_states()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 288, in _sync_states
state = self.worker_state[process.name].get("state")
File "<string>", line 2, in __getitem__
File "/usr/lib/python3.9/multiprocessing/managers.py", line 824, in _callmethod
raise convert_to_error(kind, result)
KeyError: 'Sanic-Server-0-0'
Traceback (most recent call last):
File "/var/www/fernand/www/server.py", line 47, in <module>
app.run(**params)
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/mixins/startup.py", line 209, in run
serve(primary=self) # type: ignore
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/mixins/startup.py", line 862, in serve
manager.run()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 95, in run
self.monitor()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 197, in monitor
self._sync_states()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 288, in _sync_states
state = self.worker_state[process.name].get("state")
File "<string>", line 2, in __getitem__
File "/usr/lib/python3.9/multiprocessing/managers.py", line 824, in _callmethod
raise convert_to_error(kind, result)
KeyError: 'Sanic-Server-0-0'
Task was destroyed but it is pending!
task: <Task pending name='Task-1965' coro=<WebsocketFrameAssembler.get() done, defined at /var/www/fernand/env/lib/python3.9/site-packages/sanic/server/websockets/frame.py:91> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f661a20f400>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-1966' coro=<Event.wait() done, defined at /usr/lib/python3.9/asyncio/locks.py:213> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f661bd93940>()]> cb=[_release_waiter(<Future pendi...61a20f400>()]>)() at /usr/lib/python3.9/asyncio/tasks.py:416]>
Task was destroyed but it is pending!
task: <Task pending name='Task-112' coro=<WebsocketFrameAssembler.get() done, defined at /var/www/fernand/env/lib/python3.9/site-packages/sanic/server/websockets/frame.py:91> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fe9fa7c2d30>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-113' coro=<Event.wait() done, defined at /usr/lib/python3.9/asyncio/locks.py:213> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fe9f8876b80>()]> cb=[_release_waiter(<Future pendi...9fa7c2d30>()]>)() at /usr/lib/python3.9/asyncio/tasks.py:416]>
[2023-03-09 20:45:48 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:45:48 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:47:40 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:47:40 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:52:52 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 21:44:45 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 22:00:19 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 22:49:26 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 23:05:51 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 23:55:19 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 00:13:10 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 01:02:38 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 01:18:42 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 02:10:15 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 02:59:53 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 03:40:14 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 04:30:29 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 05:19:06 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 05:36:01 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 06:00:36 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 06:17:04 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 07:06:58 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 07:24:38 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 07:55:12 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 08:05:58 +0000] [756031] [WARNING] Websocket timed out waiting for pong
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 205, in run
self.dispatch_events(self.event_queue)
File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 381, in dispatch_events
handler.dispatch(event)
File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/events.py", line 272, in dispatch
{
File "/var/www/fernand/www/surge/server/auto_reload.py", line 14, in on_modified
self.app.manager.restart()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 131, in restart
process.restart(restart_order=restart_order, **kwargs)
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 100, in restart
self.spawn()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 123, in spawn
raise Exception("Cannot spawn a worker process until it is idle.")
Exception: Cannot spawn a worker process until it is idle.
In the stdout log file from Supervisor, I have something:
[2023-03-09 19:48:11 +0000] [755976] [INFO] Stopping worker [755976]
[2023-03-09 19:48:15 +0000] [756031] [INFO] connection open
[2023-03-09 19:48:15 +0000] [756031] [INFO] Starting worker [756031]
[2023-03-09 20:46:55 +0000] [756031] [INFO] connection open
[2023-03-09 20:48:10 +0000] [756031] [INFO] connection open
[2023-03-09 20:52:27 +0000] [756031] [INFO] connection open
[2023-03-09 21:44:20 +0000] [756031] [INFO] connection open
[2023-03-09 21:59:54 +0000] [756031] [INFO] connection open
[2023-03-09 22:49:01 +0000] [756031] [INFO] connection open
[2023-03-09 23:05:26 +0000] [756031] [INFO] connection open
[2023-03-09 23:54:54 +0000] [756031] [INFO] connection open
[2023-03-10 00:12:45 +0000] [756031] [INFO] connection open
[2023-03-10 01:02:13 +0000] [756031] [INFO] connection open
[2023-03-10 01:18:17 +0000] [756031] [INFO] connection open
[2023-03-10 02:09:45 +0000] [756031] [INFO] connection open
[2023-03-10 02:09:50 +0000] [756031] [INFO] connection open
[2023-03-10 02:59:26 +0000] [756031] [INFO] connection open
[2023-03-10 02:59:28 +0000] [756031] [INFO] connection open
[2023-03-10 03:39:44 +0000] [756031] [INFO] connection open
[2023-03-10 03:39:49 +0000] [756031] [INFO] connection open
[2023-03-10 04:30:04 +0000] [756031] [INFO] connection open
[2023-03-10 05:18:41 +0000] [756031] [INFO] connection open
[2023-03-10 05:35:36 +0000] [756031] [INFO] connection open
[2023-03-10 06:00:11 +0000] [756031] [INFO] connection open
[2023-03-10 06:16:39 +0000] [756031] [INFO] connection open
[2023-03-10 07:06:33 +0000] [756031] [INFO] connection open
[2023-03-10 07:24:13 +0000] [756031] [INFO] connection open
[2023-03-10 07:42:58 +0000] [756031] [INFO] connection open
[2023-03-10 07:43:04 +0000] [756031] [INFO] connection open
[2023-03-10 07:54:41 +0000] [756031] [INFO] connection open
[2023-03-10 07:54:47 +0000] [756031] [INFO] connection open
[2023-03-10 07:55:22 +0000] [756031] [INFO] connection open
[2023-03-10 08:05:33 +0000] [756031] [INFO] connection open
[2023-03-10 08:15:30 +0000] [756031] [INFO] Stopping worker [756031]
These are the last lines. So the worker 756031 was stopped, and contrary to before, wasn't restarted (somehow)
I'm starting to think the issue is related to this
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 205, in run
self.dispatch_events(self.event_queue)
File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 381, in dispatch_events
handler.dispatch(event)
File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/events.py", line 272, in dispatch
{
File "/var/www/fernand/www/surge/server/auto_reload.py", line 14, in on_modified
self.app.manager.restart()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 131, in restart
process.restart(restart_order=restart_order, **kwargs)
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 100, in restart
self.spawn()
File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 123, in spawn
raise Exception("Cannot spawn a worker process until it is idle.")
Exception: Cannot spawn a worker process until it is idle.
The process was restarted, but it failed because of the exception above, and put my server in a not-responding state.
To clarify before going into the next steps: The last 4 comments are related to my main code I'm running (in Dev in this case).
At around the same time at the "[2023-03-10 08:15:30 +0000] [756031] [INFO] Stopping worker [756031]", I triggered a push on Github, that called the webhook to my other "deploy" code I shared initially.
As you can see in the code, it touches the "reload" file. In the code, when the "reload" file is touched, it triggers the following code:
from watchdog.events import FileSystemEventHandler
from watchdog.observers import Observer
import os
class ModificationEventLoader(FileSystemEventHandler):
def __init__(self, app):
self.app = app
def on_modified(self, event):
if event.is_directory:
return
self.app.manager.restart()
class AutoReload:
"""Watches the "reload" file for any changes and triggers a reload of Sanic"""
def __init__(self, app):
self.app = None
self.observer = None
self.reload_file_path = os.path.join(app.ctx.root_path, 'reload')
if not os.path.isfile(self.reload_file_path):
# We print here because the app was not instantiated!
print('Missing reload file. Auto reload will be disabled')
self.reload_file_path = None
return
app.register_listener(self.listen, 'main_process_ready')
app.register_listener(self.stop, 'main_process_stop')
def listen(self, app):
if self.reload_file_path:
self.observer = Observer()
self.observer.schedule(ModificationEventLoader(app), self.reload_file_path, recursive=False)
self.observer.start()
def stop(self, app, loop):
"""Stops the observer, if initiated"""
if self.observer:
self.observer.stop()
self.observer.join()
The code was discussed here: https://github.com/sanic-org/sanic/issues/2557
I have 89 tcp connections made to Sanic in CLOSE_WAIT:
$> netstat -ntap | grep 5000
tcp 101 0 127.0.0.1:5000 0.0.0.0:* LISTEN 754265/python
tcp 757 0 127.0.0.1:5000 127.0.0.1:39514 CLOSE_WAIT -
tcp 3834 0 127.0.0.1:5000 127.0.0.1:37230 CLOSE_WAIT -
tcp 20653 0 127.0.0.1:5000 127.0.0.1:49348 CLOSE_WAIT -
tcp 3692 0 127.0.0.1:5000 127.0.0.1:43332 CLOSE_WAIT -
tcp 7990 0 127.0.0.1:5000 127.0.0.1:53790 CLOSE_WAIT -
tcp 4651 0 127.0.0.1:5000 127.0.0.1:46982 CLOSE_WAIT -
tcp 3827 0 127.0.0.1:5000 127.0.0.1:48450 CLOSE_WAIT -
tcp 3659 0 127.0.0.1:5000 127.0.0.1:37000 CLOSE_WAIT -
tcp 4637 0 127.0.0.1:5000 127.0.0.1:38572 CLOSE_WAIT -
tcp 20653 0 127.0.0.1:5000 127.0.0.1:41794 CLOSE_WAIT -
tcp 3705 0 127.0.0.1:5000 127.0.0.1:35350 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:54464 CLOSE_WAIT -
tcp 4641 0 127.0.0.1:5000 127.0.0.1:34316 CLOSE_WAIT -
tcp 5651 0 127.0.0.1:5000 127.0.0.1:41714 CLOSE_WAIT -
tcp 9009 0 127.0.0.1:5000 127.0.0.1:33474 CLOSE_WAIT -
tcp 3666 0 127.0.0.1:5000 127.0.0.1:60606 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:36306 CLOSE_WAIT -
tcp 5755 0 127.0.0.1:5000 127.0.0.1:54454 CLOSE_WAIT -
tcp 3738 0 127.0.0.1:5000 127.0.0.1:48412 CLOSE_WAIT -
tcp 8008 0 127.0.0.1:5000 127.0.0.1:35736 CLOSE_WAIT -
tcp 5458 0 127.0.0.1:5000 127.0.0.1:49398 CLOSE_WAIT -
tcp 3710 0 127.0.0.1:5000 127.0.0.1:51816 CLOSE_WAIT -
tcp 4796 0 127.0.0.1:5000 127.0.0.1:50146 CLOSE_WAIT -
tcp 4391 0 127.0.0.1:5000 127.0.0.1:50022 CLOSE_WAIT -
tcp 3738 0 127.0.0.1:5000 127.0.0.1:57804 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:35564 CLOSE_WAIT -
tcp 3709 0 127.0.0.1:5000 127.0.0.1:36954 CLOSE_WAIT -
tcp 4641 0 127.0.0.1:5000 127.0.0.1:51820 CLOSE_WAIT -
tcp 3830 0 127.0.0.1:5000 127.0.0.1:33918 CLOSE_WAIT -
tcp 4634 0 127.0.0.1:5000 127.0.0.1:53772 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:43962 CLOSE_WAIT -
tcp 9009 0 127.0.0.1:5000 127.0.0.1:54768 CLOSE_WAIT -
tcp 9009 0 127.0.0.1:5000 127.0.0.1:48922 CLOSE_WAIT -
tcp 3806 0 127.0.0.1:5000 127.0.0.1:39024 CLOSE_WAIT -
tcp 3658 0 127.0.0.1:5000 127.0.0.1:39088 CLOSE_WAIT -
tcp 4637 0 127.0.0.1:5000 127.0.0.1:45232 CLOSE_WAIT -
tcp 3833 0 127.0.0.1:5000 127.0.0.1:44096 CLOSE_WAIT -
tcp 5617 0 127.0.0.1:5000 127.0.0.1:44116 CLOSE_WAIT -
tcp 9009 0 127.0.0.1:5000 127.0.0.1:49438 CLOSE_WAIT -
tcp 3956 0 127.0.0.1:5000 127.0.0.1:35344 CLOSE_WAIT -
tcp 3720 0 127.0.0.1:5000 127.0.0.1:46970 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:51446 CLOSE_WAIT -
tcp 5688 0 127.0.0.1:5000 127.0.0.1:45252 CLOSE_WAIT -
tcp 9009 0 127.0.0.1:5000 127.0.0.1:55490 CLOSE_WAIT -
tcp 3666 0 127.0.0.1:5000 127.0.0.1:60590 CLOSE_WAIT -
tcp 9009 0 127.0.0.1:5000 127.0.0.1:34804 CLOSE_WAIT -
tcp 3719 0 127.0.0.1:5000 127.0.0.1:46728 CLOSE_WAIT -
tcp 5378 0 127.0.0.1:5000 127.0.0.1:57188 CLOSE_WAIT -
tcp 3717 0 127.0.0.1:5000 127.0.0.1:41702 CLOSE_WAIT -
tcp 3658 0 127.0.0.1:5000 127.0.0.1:60870 CLOSE_WAIT -
tcp 4426 0 127.0.0.1:5000 127.0.0.1:46600 CLOSE_WAIT -
tcp 5473 0 127.0.0.1:5000 127.0.0.1:51112 CLOSE_WAIT -
tcp 3626 0 127.0.0.1:5000 127.0.0.1:38656 CLOSE_WAIT -
tcp 3666 0 127.0.0.1:5000 127.0.0.1:60600 CLOSE_WAIT -
tcp 5689 0 127.0.0.1:5000 127.0.0.1:60662 CLOSE_WAIT -
tcp 7991 0 127.0.0.1:5000 127.0.0.1:51826 CLOSE_WAIT -
tcp 4650 0 127.0.0.1:5000 127.0.0.1:55476 CLOSE_WAIT -
tcp 7991 0 127.0.0.1:5000 127.0.0.1:55264 CLOSE_WAIT -
tcp 3710 0 127.0.0.1:5000 127.0.0.1:34312 CLOSE_WAIT -
tcp 6176 0 127.0.0.1:5000 127.0.0.1:51116 CLOSE_WAIT -
tcp 3659 0 127.0.0.1:5000 127.0.0.1:53654 CLOSE_WAIT -
tcp 4640 0 127.0.0.1:5000 127.0.0.1:57776 CLOSE_WAIT -
tcp 4426 0 127.0.0.1:5000 127.0.0.1:52996 CLOSE_WAIT -
tcp 20653 0 127.0.0.1:5000 127.0.0.1:43072 CLOSE_WAIT -
tcp 3703 0 127.0.0.1:5000 127.0.0.1:53770 CLOSE_WAIT -
tcp 7957 0 127.0.0.1:5000 127.0.0.1:35334 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:56958 CLOSE_WAIT -
tcp 3659 0 127.0.0.1:5000 127.0.0.1:37838 CLOSE_WAIT -
tcp 4484 0 127.0.0.1:5000 127.0.0.1:57854 CLOSE_WAIT -
tcp 5618 0 127.0.0.1:5000 127.0.0.1:47650 CLOSE_WAIT -
tcp 4483 0 127.0.0.1:5000 127.0.0.1:48052 CLOSE_WAIT -
tcp 3672 0 127.0.0.1:5000 127.0.0.1:45472 CLOSE_WAIT -
tcp 20653 0 127.0.0.1:5000 127.0.0.1:55744 CLOSE_WAIT -
tcp 6206 0 127.0.0.1:5000 127.0.0.1:59702 CLOSE_WAIT -
tcp 3705 0 127.0.0.1:5000 127.0.0.1:45218 CLOSE_WAIT -
tcp 3627 0 127.0.0.1:5000 127.0.0.1:38640 CLOSE_WAIT -
tcp 6350 0 127.0.0.1:5000 127.0.0.1:47006 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:46942 CLOSE_WAIT -
tcp 3659 0 127.0.0.1:5000 127.0.0.1:33784 CLOSE_WAIT -
tcp 3738 0 127.0.0.1:5000 127.0.0.1:43836 CLOSE_WAIT -
tcp 757 0 127.0.0.1:5000 127.0.0.1:35404 CLOSE_WAIT -
tcp 5689 0 127.0.0.1:5000 127.0.0.1:45352 CLOSE_WAIT -
tcp 3837 0 127.0.0.1:5000 127.0.0.1:47204 CLOSE_WAIT -
tcp 9009 0 127.0.0.1:5000 127.0.0.1:33794 CLOSE_WAIT -
tcp 3659 0 127.0.0.1:5000 127.0.0.1:41272 CLOSE_WAIT -
tcp 6111 0 127.0.0.1:5000 127.0.0.1:46084 CLOSE_WAIT -
The "stopped process (756031) doesn't seems to have any socket opened:
$> lsof -i -P |grep 756031 # returns nothing
$> netstat --all --program | grep '756031' # returns nothing
Starting from Supervisor $PID, here's the tree (with useless parts removed)
$> ps --forest -o pid,tty,stat,time,cmd -g 23953
PID TT STAT TIME CMD
23953 ? Ss 00:27:16 /usr/bin/python3 /usr/bin/supervisord -n -c /etc/supervisor/supervisord.conf
658082 ? S 01:14:52 \_ /var/www/fernand/env/bin/python /var/www/fernand/www/scripts/server/deploy.py
658084 ? Sl 00:52:59 | \_ /var/www/fernand/env/bin/python /var/www/fernand/www/scripts/server/deploy.py
658090 ? S 00:00:00 | \_ /var/www/fernand/env/bin/python -c from multiprocessing.resource_tracker import main;main(15)
658091 ? Z 00:00:03 | \_ [python] <defunct>
658092 ? S 01:50:09 | \_ /var/www/fernand/env/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=20) --multiprocessing-fork
761850 ? S 00:00:00 | \_ /var/www/fernand/env/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=22) --multiprocessing-fork
754265 ? Sl 00:07:14 \_ /var/www/fernand/env/bin/python /var/www/fernand/www/server.py --prod
754268 ? Sl 00:05:08 | \_ /var/www/fernand/env/bin/python /var/www/fernand/www/server.py --prod
754277 ? S 00:00:00 | \_ /var/www/fernand/env/bin/python -c from multiprocessing.resource_tracker import main;main(16)
754279 ? S 00:00:03 | \_ /var/www/fernand/env/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=21) --multiprocessing-fork
755976 ? Z 00:00:02 | \_ [python] <defunct>
756031 ? Z 00:00:18 | \_ [python] <defunct>
The last line is interesting! The Python script is in <defunct>
state for the Process that failed to stop properly (error above). We can also see that the processes 755976 and 658091 are also defunct.
Triggering a reload via the inspector seems to do the job:
curl -X POST http://127.0.0.1:6457/reload
And in the logs:
[2023-03-10 14:07:08 +0000] [754279] [INFO] Incoming inspector action: reload
[2023-03-10 14:07:09 +0000] [765435] [INFO] connection open
[2023-03-10 14:07:10 +0000] [765435] [INFO] connection open
[2023-03-10 14:07:11 +0000] [765435] [INFO] connection open
[2023-03-10 14:07:11 +0000] [765435] [INFO] connection open
And the new state from the inspector:
{
"meta": {
"action": "info"
},
"result": {
"info": {
"mode": "production, goin' fast single worker",
"server": "unknown",
"python": "3.9.2",
"platform": "Linux-5.10.0-19-cloud-amd64-x86_64-with-glibc2.31",
"packages": ["sanic==22.12.0", "sanic-routing==22.8.0"],
"extra": {}
},
"workers": {
"Sanic-Main": {
"pid": 754265
},
"Sanic-Server-0-0": {
"server": true,
"state": "ACKED",
"pid": 765435,
"start_at": "2023-03-09T17:09:17.486716+00:00",
"starts": 7,
"restart_at": "2023-03-10T14:07:08.334144+00:00"
},
"Sanic-Inspector-0": {
"server": false,
"state": "STARTED",
"pid": 754279,
"start_at": "2023-03-09T17:09:17.503364+00:00",
"starts": 1
}
}
}
}
The server has restarted properly like this.
Wow, lots of good info to dig into. I won't have a chance until Sunday, but I'll take a look then.
If that can helps, I'm glad its helpful.
On my side, what I did was remove the watchdog waiting on the "reload" file, and trigger a reload directly via the INSPECTOR. The "deploy" script now sends a post requests to http://localhost:6457/reload
instead of touching the reload file.
I would be curious if you continue to see the problem after changing to the INSPECTOR. Just a quick word of caution since it sounds like you are in a production environment: if it is a publicly accessible URL, make sure to protect it.
From my understanding, the default host for the inspector is set to 127.0.0.1. I think it shouldn't be available from outside if that's the case.
From my understanding, the default host for the inspector is set to 127.0.0.1. I think it shouldn't be available from outside if that's the case.
Correct.
Is there an existing issue for this?
Describe the bug
I made a basic Sanic instance that only reads from Github webhooks to trigger an Update. Here's the code:
https://gist.github.com/cnicodeme/37093f069822bef0dd1883c25fb54372
I'm running this code with supervisor with the following configuration:
In both Sanic v22.9.1 or v22.12.0, that servers stop responding at some point. Sending a requests doesn't return any responses.
Unfortunately, I cannot give much more details about this because since the server hangs, there are no proper errors returned or anything.
I'm happy to test anything suggested when the server will hang again. What I'm hoping here is to discover that this is a known issue somehow (I haven't found anything like this on the tickets) and an easy fix is available.
The above code is a basic server having this issue, but I'm facing the same problem with my real server too.
Code snippet
No response
Expected Behavior
No response
How do you run Sanic?
As a script (
app.run
orSanic.serve
)Operating System
Linux Debian 11
Sanic Version
22.12.0
Additional context
No response