Sanic instances seems to hang forever is ran for too long

cnicodeme commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the bug

I made a basic Sanic instance that only reads from Github webhooks to trigger an Update. Here's the code:

https://gist.github.com/cnicodeme/37093f069822bef0dd1883c25fb54372

I'm running this code with supervisor with the following configuration:

[program:deploy]
user = me
command=/var/www/project/env/bin/python /var/www/project/www/deploy.py
autostart=true
autorestart=true
stopsignal=INT
startsecs=2
stopwaitsecs=5

In both Sanic v22.9.1 or v22.12.0, that servers stop responding at some point. Sending a requests doesn't return any responses.

Unfortunately, I cannot give much more details about this because since the server hangs, there are no proper errors returned or anything.

I'm happy to test anything suggested when the server will hang again. What I'm hoping here is to discover that this is a known issue somehow (I haven't found anything like this on the tickets) and an easy fix is available.

The above code is a basic server having this issue, but I'm facing the same problem with my real server too.

Code snippet

No response

Expected Behavior

No response

How do you run Sanic?

As a script (app.run or Sanic.serve)

Operating System

Linux Debian 11

Sanic Version

22.12.0

Additional context

No response

Tronic commented 1 year ago

You make subprocess calls that block the execution. With an async server like Sanic you should await on calls that may take time to finish, and this you cannot do with subprocess.call.

You will need an alternative way of running your commands using async/await.

cnicodeme commented 1 year ago

Do you have any suggestions on how to make these calls ?

cnicodeme commented 1 year ago

Also, on my other server running Sanic that freeze, there are no call to subprocess.call but it's more complex (websockets, calls to database, Redis and the likes)

Tronic commented 1 year ago

One alternative: https://docs.python.org/3/library/asyncio-subprocess.html

Similarly with Redis and any other database etc, make sure that all your calls are asynchronous. Some examples with aioredis https://aioredis.readthedocs.io/en/latest/getting-started/#response-decoding

Tronic commented 1 year ago

Btw, I used Bing Chat to find these answers quickly. I suggest you get ChatGPT, Bing Chat or some other chatbot for such quick answers to difficult problems. Ask them for async/await alternative to what you are currently using. ChatGPT might not be ideal because it doesn't know of any new libraries (its knowledge is cut off at year 2021), so you.com, Github Copilot or Bing chats might be preferable.

cnicodeme commented 1 year ago

Thank you I'll try. But I'm not sure this will resolve the issue (I've been experiencing this "hanging" issue across many Sanic instances on various servers)

Tronic commented 1 year ago

It will hang if you do anything that takes time without await on the blocking call (requests from database, making http requests, file I/O and running external commands being prime examples of such things). If you make lengthy calculation or don't have a library available that would use async/await you can use a thread to do this:

await asyncio.get_event_loop().run_in_executor(None, your_function, args...)

ahopkins commented 1 year ago

Yes, using asyncio subprocess is better because it will yield back to the loop. But, that alone is not necessarily enough. If you have nothing else to yield to and you're awaiting that response it it is still blocking. This is why io is best in loops with timeouts, etc.

If you are making lengthy subprocess calls, I suggest another pattern of pushing off background work that can be checked in a loop, or that can be sent to another process.

cnicodeme commented 1 year ago

One thing that might worth be sharing, but the script above runs well and respond to the server properly.

Then, a few weeks after (yes, weeks), that scripts stops responding.

I can imagine that calling a subprocess then trying to stop the code right after might cause it to not respond, but if the command is finished, the server should behave correctly no? So trying to send another POST request 2 weeks later should work?

(Or does calling the subprocess modify the state in a way that it makes subsequent request fails, no matter when)?

ahopkins commented 1 year ago

Wait, are you trying to live update a running instance? Or, is that meant to be updating a different script/env/application?

ahopkins commented 1 year ago

So it does not sound like what you are saying is that the subprocess call is blocking other requests from coming, but the application is just dead?

ahopkins commented 1 year ago

There is no inherent problem in using subprocess.call (whether the regular or async version). To prove the point:

import subprocess
import sys

from sanic import Request, Sanic, json

app = Sanic("TestApp")

@app.get("/")
async def handler(request: Request):
    assert (
        subprocess.call(
            "ls -l",
            shell=True,
            cwd="./",
            stderr=sys.stdout,
            stdout=sys.stdout,
        )
        == 0
    ), "Unable to show the directory"
    return json({"foo": "bar"})

cnicodeme commented 1 year ago

Yes, the application is dead, but not immediately, more after a few weeks (using or not using it).

I'll try to be as verbose as possible to help.

I'm using Sanic on two projects, GetFernand.com and PDFShift.io. On Fernand, I also use Websockets. Both are using SQLAlchemy in async mode, and Redis (in case that helps).

For both these services, I've also created the "server.py" script above that is completely independent (but use the same virtual environment). The purpose of this server.py script is to update the packages, update the database, update the code and trigger a reload whenever necessary. (IF "requirements.txt" has changed, we trigger an package update. If there is new files in the migrations folder, we trigger a database update, etc).

The code works fine via Github, and I can trigger a few push notification without any problems. But after a few weeks (generally around 2 weeks), any Sanic instance stops responding.

I've shared the server.py script because I have the same issue here, and it's a pretty basic one. I do have the same issues with the main applications at Fernand and PDFShift but the code is way more complex so it would be more problematic to debug it.

PDFShift & Fernand are on different servers (all Debian, mostly 11). Fernand is on v22.12 while PDFShift is on v22.9.

I recently created a copy of the Fernand code to run in another server for our demo. In this case, the Redis and MySQL server are all local. Every once in a while, the server stops accepting requests and again, Sanic is frozen.

I have a Caddy server in front that dispatch to Sanic. Caddy works, and the logs displays a 502 when trying to reach out to the Sanic server.

I hope this brings more details to it. As I've said, I unfortunately have no output I can recover from Sanic that show a specific error that could explain it. I've been digging the logs with no luck.

ahopkins commented 1 year ago

Do you have anything that is checking for health? Pinging the application every X seconds for a response.

ahopkins commented 1 year ago

I wonder if this would be helpful: https://sanic.dev/en/plugins/sanic-ext/health-monitor.html

ahopkins commented 1 year ago

Also, when the app is dead, does that just mean it no longer is accepting external requests? What if you SSH into the machine and try pinging it directly?

cnicodeme commented 1 year ago

I just enabled access_log to see if Caddy is pinging the app and it's not the case (the access log is properly updated because if I load a page manually, it shows the request). So I'd say there is no health check from Caddy.

(There is no healthcheck, even outside or scripts for the above app)

When the app is dead, the server still respond, I can SSH into it, but I tried once to do a cURL command, and it failed. (I can recall if it failed after a timeout or immediately though, but I recall I had no errors written and no content returned).

Also, one of my big application is running websockets, and I noticed that somehow, one of the dev team had kept their tab open, and a request was done every ~100seconds to the wss endpoint. The response had 0 bytes returned but then, at some point, it failed and returned a 502 (from Caddy).

Again, looking at the logs shows nothing, but it seems that the server stops accepting requests at some point.

Also, after restarting the server, the websocket endpoint doesn't respond for a while. From what I was able to find, it was still processing previous pending requests.

I gotta say this is kind of blurry to me and I don't have a clear pattern I can give.

cnicodeme commented 1 year ago

One of my server just got down. I'll update here with some findings but if you want me to run some commands, let me know!

cnicodeme commented 1 year ago

Ok, first, when trying to load my /ping endpoint, it takes around 3seconds then fail. Here's the log from Caddy:

{
    "level": "error",
    "ts": 1678454660.1380699,
    "logger": "http.log.access.log0",
    "msg": "handled request",
    "request": {
        "remote_ip": "my_ip_v6",
        "remote_port": "56283",
        "proto": "HTTP/3.0",
        "method": "GET",
        "host": "my.server.dev",
        "uri": "/ping",
        "headers": {
            "Dnt": ["1"],
            "Sec-Fetch-Mode": ["navigate"],
            "Sec-Fetch-Site": ["none"],
            "Sec-Fetch-User": ["?1"],
            "Accept-Encoding": ["gzip, deflate, br"],
            "Sec-Ch-Ua": ["\"Chromium\";v=\"110\", \"Not A(Brand\";v=\"24\", \"Brave\";v=\"110\""],
            "Upgrade-Insecure-Requests": ["1"],
            "User-Agent": ["Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"],
            "Sec-Gpc": ["1"],
            "Sec-Fetch-Dest": ["document"],
            "Sec-Ch-Ua-Mobile": ["?0"],
            "Sec-Ch-Ua-Platform": ["\"Linux\""],
            "Accept": ["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8"],
            "Accept-Language": ["fr-FR,fr;q=0.9"]
        },
        "tls": {
            "resumed": false,
            "version": 0,
            "cipher_suite": 0,
            "proto": "",
            "server_name": ""
        }
    },
    "user_id": "",
    "duration": 3.001652863,
    "size": 0,
    "status": 502,
    "resp_headers": {
        "Server": ["Caddy"]
    }
}

We can notice the last parts: "duration":3.001652863,"size":0,"status":502

The local server is running on the port 5000. In the server, I'm running curl --trace - "http://127.0.0.1:5000/ping". Here's the output:

== Info:   Trying 127.0.0.1:5000...

That's all. I'm guessing the 3s from Caddy is the timeout to connecting to the server.

Good news, Sanic:Inspector is running! (I had enabled it last time it failed)

#> curl http://localhost:6457
{
    "meta": {
        "action": "info"
    },
    "result": {
        "info": {
            "mode": "production, goin' fast single worker",
            "server": "unknown",
            "python": "3.9.2",
            "platform": "Linux-5.10.0-19-cloud-amd64-x86_64-with-glibc2.31",
            "packages": ["sanic==22.12.0", "sanic-routing==22.8.0"],
            "extra": {}
        },
        "workers": {
            "Sanic-Main": {
                "pid": 754265
            },
            "Sanic-Server-0-0": {
                "server": true,
                "state": "ACKED",
                "pid": 756031,
                "start_at": "2023-03-09T17:09:17.486716+00:00",
                "starts": 6,
                "restart_at": "2023-03-09T19:48:11.687590+00:00"
            },
            "Sanic-Inspector-0": {
                "server": false,
                "state": "STARTED",
                "pid": 754279,
                "start_at": "2023-03-09T17:09:17.503364+00:00",
                "starts": 1
            }
        }
    }
}

(I don't know if that helps)

I could try a POST /reload but I first want to wait if you want to try some things.

I'll continue to update this ticket along with my findings.

cnicodeme commented 1 year ago

Here's some errors I got on /var/log/supervisor/fernand-stderr-... :

[2023-03-09 17:09:08 +0000] [658080] [ERROR] Experienced exception while trying to serve
Traceback (most recent call last):
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/mixins/startup.py", line 862, in serve
    manager.run()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 95, in run
    self.monitor()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 197, in monitor
    self._sync_states()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 288, in _sync_states
    state = self.worker_state[process.name].get("state")
  File "<string>", line 2, in __getitem__
  File "/usr/lib/python3.9/multiprocessing/managers.py", line 824, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'Sanic-Server-0-0'
Traceback (most recent call last):
  File "/var/www/fernand/www/server.py", line 47, in <module>
    app.run(**params)
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/mixins/startup.py", line 209, in run
    serve(primary=self)  # type: ignore
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/mixins/startup.py", line 862, in serve
    manager.run()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 95, in run
    self.monitor()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 197, in monitor
    self._sync_states()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 288, in _sync_states
    state = self.worker_state[process.name].get("state")
  File "<string>", line 2, in __getitem__
  File "/usr/lib/python3.9/multiprocessing/managers.py", line 824, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'Sanic-Server-0-0'
Task was destroyed but it is pending!
task: <Task pending name='Task-1965' coro=<WebsocketFrameAssembler.get() done, defined at /var/www/fernand/env/lib/python3.9/site-packages/sanic/server/websockets/frame.py:91> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f661a20f400>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-1966' coro=<Event.wait() done, defined at /usr/lib/python3.9/asyncio/locks.py:213> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f661bd93940>()]> cb=[_release_waiter(<Future pendi...61a20f400>()]>)() at /usr/lib/python3.9/asyncio/tasks.py:416]>
Task was destroyed but it is pending!
task: <Task pending name='Task-112' coro=<WebsocketFrameAssembler.get() done, defined at /var/www/fernand/env/lib/python3.9/site-packages/sanic/server/websockets/frame.py:91> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fe9fa7c2d30>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-113' coro=<Event.wait() done, defined at /usr/lib/python3.9/asyncio/locks.py:213> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fe9f8876b80>()]> cb=[_release_waiter(<Future pendi...9fa7c2d30>()]>)() at /usr/lib/python3.9/asyncio/tasks.py:416]>
[2023-03-09 20:45:48 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:45:48 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:47:40 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:47:40 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 20:52:52 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 21:44:45 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 22:00:19 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 22:49:26 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 23:05:51 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-09 23:55:19 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 00:13:10 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 01:02:38 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 01:18:42 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 02:10:15 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 02:59:53 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 03:40:14 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 04:30:29 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 05:19:06 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 05:36:01 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 06:00:36 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 06:17:04 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 07:06:58 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 07:24:38 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 07:55:12 +0000] [756031] [WARNING] Websocket timed out waiting for pong
[2023-03-10 08:05:58 +0000] [756031] [WARNING] Websocket timed out waiting for pong
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 205, in run
    self.dispatch_events(self.event_queue)
  File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 381, in dispatch_events
    handler.dispatch(event)
  File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/events.py", line 272, in dispatch
    {
  File "/var/www/fernand/www/surge/server/auto_reload.py", line 14, in on_modified
    self.app.manager.restart()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 131, in restart
    process.restart(restart_order=restart_order, **kwargs)
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 100, in restart
    self.spawn()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 123, in spawn
    raise Exception("Cannot spawn a worker process until it is idle.")
Exception: Cannot spawn a worker process until it is idle.

In the stdout log file from Supervisor, I have something:

[2023-03-09 19:48:11 +0000] [755976] [INFO] Stopping worker [755976]
[2023-03-09 19:48:15 +0000] [756031] [INFO] connection open
[2023-03-09 19:48:15 +0000] [756031] [INFO] Starting worker [756031]
[2023-03-09 20:46:55 +0000] [756031] [INFO] connection open
[2023-03-09 20:48:10 +0000] [756031] [INFO] connection open
[2023-03-09 20:52:27 +0000] [756031] [INFO] connection open
[2023-03-09 21:44:20 +0000] [756031] [INFO] connection open
[2023-03-09 21:59:54 +0000] [756031] [INFO] connection open
[2023-03-09 22:49:01 +0000] [756031] [INFO] connection open
[2023-03-09 23:05:26 +0000] [756031] [INFO] connection open
[2023-03-09 23:54:54 +0000] [756031] [INFO] connection open
[2023-03-10 00:12:45 +0000] [756031] [INFO] connection open
[2023-03-10 01:02:13 +0000] [756031] [INFO] connection open
[2023-03-10 01:18:17 +0000] [756031] [INFO] connection open
[2023-03-10 02:09:45 +0000] [756031] [INFO] connection open
[2023-03-10 02:09:50 +0000] [756031] [INFO] connection open
[2023-03-10 02:59:26 +0000] [756031] [INFO] connection open
[2023-03-10 02:59:28 +0000] [756031] [INFO] connection open
[2023-03-10 03:39:44 +0000] [756031] [INFO] connection open
[2023-03-10 03:39:49 +0000] [756031] [INFO] connection open
[2023-03-10 04:30:04 +0000] [756031] [INFO] connection open
[2023-03-10 05:18:41 +0000] [756031] [INFO] connection open
[2023-03-10 05:35:36 +0000] [756031] [INFO] connection open
[2023-03-10 06:00:11 +0000] [756031] [INFO] connection open
[2023-03-10 06:16:39 +0000] [756031] [INFO] connection open
[2023-03-10 07:06:33 +0000] [756031] [INFO] connection open
[2023-03-10 07:24:13 +0000] [756031] [INFO] connection open
[2023-03-10 07:42:58 +0000] [756031] [INFO] connection open
[2023-03-10 07:43:04 +0000] [756031] [INFO] connection open
[2023-03-10 07:54:41 +0000] [756031] [INFO] connection open
[2023-03-10 07:54:47 +0000] [756031] [INFO] connection open
[2023-03-10 07:55:22 +0000] [756031] [INFO] connection open
[2023-03-10 08:05:33 +0000] [756031] [INFO] connection open
[2023-03-10 08:15:30 +0000] [756031] [INFO] Stopping worker [756031]

These are the last lines. So the worker 756031 was stopped, and contrary to before, wasn't restarted (somehow)

cnicodeme commented 1 year ago

I'm starting to think the issue is related to this

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 205, in run
    self.dispatch_events(self.event_queue)
  File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/observers/api.py", line 381, in dispatch_events
    handler.dispatch(event)
  File "/var/www/fernand/env/lib/python3.9/site-packages/watchdog/events.py", line 272, in dispatch
    {
  File "/var/www/fernand/www/surge/server/auto_reload.py", line 14, in on_modified
    self.app.manager.restart()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/manager.py", line 131, in restart
    process.restart(restart_order=restart_order, **kwargs)
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 100, in restart
    self.spawn()
  File "/var/www/fernand/env/lib/python3.9/site-packages/sanic/worker/process.py", line 123, in spawn
    raise Exception("Cannot spawn a worker process until it is idle.")
Exception: Cannot spawn a worker process until it is idle.

The process was restarted, but it failed because of the exception above, and put my server in a not-responding state.

cnicodeme commented 1 year ago

To clarify before going into the next steps: The last 4 comments are related to my main code I'm running (in Dev in this case).

At around the same time at the "[2023-03-10 08:15:30 +0000] [756031] [INFO] Stopping worker [756031]", I triggered a push on Github, that called the webhook to my other "deploy" code I shared initially.

As you can see in the code, it touches the "reload" file. In the code, when the "reload" file is touched, it triggers the following code:

from watchdog.events import FileSystemEventHandler
from watchdog.observers import Observer
import os

class ModificationEventLoader(FileSystemEventHandler):
    def __init__(self, app):
        self.app = app

    def on_modified(self, event):
        if event.is_directory:
            return

        self.app.manager.restart()

class AutoReload:
    """Watches the "reload" file for any changes and triggers a reload of Sanic"""

    def __init__(self, app):
        self.app = None
        self.observer = None

        self.reload_file_path = os.path.join(app.ctx.root_path, 'reload')
        if not os.path.isfile(self.reload_file_path):
            # We print here because the app was not instantiated!
            print('Missing reload file. Auto reload will be disabled')
            self.reload_file_path = None
            return

        app.register_listener(self.listen, 'main_process_ready')
        app.register_listener(self.stop, 'main_process_stop')

    def listen(self, app):
        if self.reload_file_path:
            self.observer = Observer()
            self.observer.schedule(ModificationEventLoader(app), self.reload_file_path, recursive=False)
            self.observer.start()

    def stop(self, app, loop):
        """Stops the observer, if initiated"""
        if self.observer:
            self.observer.stop()
            self.observer.join()

The code was discussed here: https://github.com/sanic-org/sanic/issues/2557

cnicodeme commented 1 year ago

I have 89 tcp connections made to Sanic in CLOSE_WAIT:

$> netstat -ntap | grep 5000
tcp      101      0 127.0.0.1:5000          0.0.0.0:*               LISTEN      754265/python       
tcp      757      0 127.0.0.1:5000          127.0.0.1:39514         CLOSE_WAIT  -                   
tcp     3834      0 127.0.0.1:5000          127.0.0.1:37230         CLOSE_WAIT  -                   
tcp    20653      0 127.0.0.1:5000          127.0.0.1:49348         CLOSE_WAIT  -                   
tcp     3692      0 127.0.0.1:5000          127.0.0.1:43332         CLOSE_WAIT  -                   
tcp     7990      0 127.0.0.1:5000          127.0.0.1:53790         CLOSE_WAIT  -                   
tcp     4651      0 127.0.0.1:5000          127.0.0.1:46982         CLOSE_WAIT  -                   
tcp     3827      0 127.0.0.1:5000          127.0.0.1:48450         CLOSE_WAIT  -                   
tcp     3659      0 127.0.0.1:5000          127.0.0.1:37000         CLOSE_WAIT  -                   
tcp     4637      0 127.0.0.1:5000          127.0.0.1:38572         CLOSE_WAIT  -                   
tcp    20653      0 127.0.0.1:5000          127.0.0.1:41794         CLOSE_WAIT  -                   
tcp     3705      0 127.0.0.1:5000          127.0.0.1:35350         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:54464         CLOSE_WAIT  -                   
tcp     4641      0 127.0.0.1:5000          127.0.0.1:34316         CLOSE_WAIT  -                   
tcp     5651      0 127.0.0.1:5000          127.0.0.1:41714         CLOSE_WAIT  -                   
tcp     9009      0 127.0.0.1:5000          127.0.0.1:33474         CLOSE_WAIT  -                   
tcp     3666      0 127.0.0.1:5000          127.0.0.1:60606         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:36306         CLOSE_WAIT  -                   
tcp     5755      0 127.0.0.1:5000          127.0.0.1:54454         CLOSE_WAIT  -                   
tcp     3738      0 127.0.0.1:5000          127.0.0.1:48412         CLOSE_WAIT  -                   
tcp     8008      0 127.0.0.1:5000          127.0.0.1:35736         CLOSE_WAIT  -                   
tcp     5458      0 127.0.0.1:5000          127.0.0.1:49398         CLOSE_WAIT  -                   
tcp     3710      0 127.0.0.1:5000          127.0.0.1:51816         CLOSE_WAIT  -                   
tcp     4796      0 127.0.0.1:5000          127.0.0.1:50146         CLOSE_WAIT  -                   
tcp     4391      0 127.0.0.1:5000          127.0.0.1:50022         CLOSE_WAIT  -                   
tcp     3738      0 127.0.0.1:5000          127.0.0.1:57804         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:35564         CLOSE_WAIT  -                   
tcp     3709      0 127.0.0.1:5000          127.0.0.1:36954         CLOSE_WAIT  -                   
tcp     4641      0 127.0.0.1:5000          127.0.0.1:51820         CLOSE_WAIT  -                   
tcp     3830      0 127.0.0.1:5000          127.0.0.1:33918         CLOSE_WAIT  -                   
tcp     4634      0 127.0.0.1:5000          127.0.0.1:53772         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:43962         CLOSE_WAIT  -                   
tcp     9009      0 127.0.0.1:5000          127.0.0.1:54768         CLOSE_WAIT  -                   
tcp     9009      0 127.0.0.1:5000          127.0.0.1:48922         CLOSE_WAIT  -                   
tcp     3806      0 127.0.0.1:5000          127.0.0.1:39024         CLOSE_WAIT  -                   
tcp     3658      0 127.0.0.1:5000          127.0.0.1:39088         CLOSE_WAIT  -                   
tcp     4637      0 127.0.0.1:5000          127.0.0.1:45232         CLOSE_WAIT  -                   
tcp     3833      0 127.0.0.1:5000          127.0.0.1:44096         CLOSE_WAIT  -                   
tcp     5617      0 127.0.0.1:5000          127.0.0.1:44116         CLOSE_WAIT  -                   
tcp     9009      0 127.0.0.1:5000          127.0.0.1:49438         CLOSE_WAIT  -                   
tcp     3956      0 127.0.0.1:5000          127.0.0.1:35344         CLOSE_WAIT  -                   
tcp     3720      0 127.0.0.1:5000          127.0.0.1:46970         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:51446         CLOSE_WAIT  -                   
tcp     5688      0 127.0.0.1:5000          127.0.0.1:45252         CLOSE_WAIT  -                   
tcp     9009      0 127.0.0.1:5000          127.0.0.1:55490         CLOSE_WAIT  -                   
tcp     3666      0 127.0.0.1:5000          127.0.0.1:60590         CLOSE_WAIT  -                   
tcp     9009      0 127.0.0.1:5000          127.0.0.1:34804         CLOSE_WAIT  -                   
tcp     3719      0 127.0.0.1:5000          127.0.0.1:46728         CLOSE_WAIT  -                   
tcp     5378      0 127.0.0.1:5000          127.0.0.1:57188         CLOSE_WAIT  -                   
tcp     3717      0 127.0.0.1:5000          127.0.0.1:41702         CLOSE_WAIT  -                   
tcp     3658      0 127.0.0.1:5000          127.0.0.1:60870         CLOSE_WAIT  -                   
tcp     4426      0 127.0.0.1:5000          127.0.0.1:46600         CLOSE_WAIT  -                   
tcp     5473      0 127.0.0.1:5000          127.0.0.1:51112         CLOSE_WAIT  -                   
tcp     3626      0 127.0.0.1:5000          127.0.0.1:38656         CLOSE_WAIT  -                   
tcp     3666      0 127.0.0.1:5000          127.0.0.1:60600         CLOSE_WAIT  -                   
tcp     5689      0 127.0.0.1:5000          127.0.0.1:60662         CLOSE_WAIT  -                   
tcp     7991      0 127.0.0.1:5000          127.0.0.1:51826         CLOSE_WAIT  -                   
tcp     4650      0 127.0.0.1:5000          127.0.0.1:55476         CLOSE_WAIT  -                   
tcp     7991      0 127.0.0.1:5000          127.0.0.1:55264         CLOSE_WAIT  -                   
tcp     3710      0 127.0.0.1:5000          127.0.0.1:34312         CLOSE_WAIT  -                   
tcp     6176      0 127.0.0.1:5000          127.0.0.1:51116         CLOSE_WAIT  -                   
tcp     3659      0 127.0.0.1:5000          127.0.0.1:53654         CLOSE_WAIT  -                   
tcp     4640      0 127.0.0.1:5000          127.0.0.1:57776         CLOSE_WAIT  -                   
tcp     4426      0 127.0.0.1:5000          127.0.0.1:52996         CLOSE_WAIT  -                   
tcp    20653      0 127.0.0.1:5000          127.0.0.1:43072         CLOSE_WAIT  -                   
tcp     3703      0 127.0.0.1:5000          127.0.0.1:53770         CLOSE_WAIT  -                   
tcp     7957      0 127.0.0.1:5000          127.0.0.1:35334         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:56958         CLOSE_WAIT  -                   
tcp     3659      0 127.0.0.1:5000          127.0.0.1:37838         CLOSE_WAIT  -                   
tcp     4484      0 127.0.0.1:5000          127.0.0.1:57854         CLOSE_WAIT  -                   
tcp     5618      0 127.0.0.1:5000          127.0.0.1:47650         CLOSE_WAIT  -                   
tcp     4483      0 127.0.0.1:5000          127.0.0.1:48052         CLOSE_WAIT  -                   
tcp     3672      0 127.0.0.1:5000          127.0.0.1:45472         CLOSE_WAIT  -                   
tcp    20653      0 127.0.0.1:5000          127.0.0.1:55744         CLOSE_WAIT  -                   
tcp     6206      0 127.0.0.1:5000          127.0.0.1:59702         CLOSE_WAIT  -                   
tcp     3705      0 127.0.0.1:5000          127.0.0.1:45218         CLOSE_WAIT  -                   
tcp     3627      0 127.0.0.1:5000          127.0.0.1:38640         CLOSE_WAIT  -                   
tcp     6350      0 127.0.0.1:5000          127.0.0.1:47006         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:46942         CLOSE_WAIT  -                   
tcp     3659      0 127.0.0.1:5000          127.0.0.1:33784         CLOSE_WAIT  -                   
tcp     3738      0 127.0.0.1:5000          127.0.0.1:43836         CLOSE_WAIT  -                   
tcp      757      0 127.0.0.1:5000          127.0.0.1:35404         CLOSE_WAIT  -                   
tcp     5689      0 127.0.0.1:5000          127.0.0.1:45352         CLOSE_WAIT  -                   
tcp     3837      0 127.0.0.1:5000          127.0.0.1:47204         CLOSE_WAIT  -                   
tcp     9009      0 127.0.0.1:5000          127.0.0.1:33794         CLOSE_WAIT  -                   
tcp     3659      0 127.0.0.1:5000          127.0.0.1:41272         CLOSE_WAIT  -                   
tcp     6111      0 127.0.0.1:5000          127.0.0.1:46084         CLOSE_WAIT  -

The "stopped process (756031) doesn't seems to have any socket opened:

$> lsof -i -P |grep 756031 # returns nothing
$> netstat --all --program | grep '756031' # returns nothing

Starting from Supervisor $PID, here's the tree (with useless parts removed)

$> ps --forest -o pid,tty,stat,time,cmd -g 23953
    PID TT       STAT     TIME CMD
  23953 ?        Ss   00:27:16 /usr/bin/python3 /usr/bin/supervisord -n -c /etc/supervisor/supervisord.conf
 658082 ?        S    01:14:52  \_ /var/www/fernand/env/bin/python /var/www/fernand/www/scripts/server/deploy.py
 658084 ?        Sl   00:52:59  |   \_ /var/www/fernand/env/bin/python /var/www/fernand/www/scripts/server/deploy.py
 658090 ?        S    00:00:00  |   \_ /var/www/fernand/env/bin/python -c from multiprocessing.resource_tracker import main;main(15)
 658091 ?        Z    00:00:03  |   \_ [python] <defunct>
 658092 ?        S    01:50:09  |   \_ /var/www/fernand/env/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=20) --multiprocessing-fork
 761850 ?        S    00:00:00  |   \_ /var/www/fernand/env/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=16, pipe_handle=22) --multiprocessing-fork
 754265 ?        Sl   00:07:14  \_ /var/www/fernand/env/bin/python /var/www/fernand/www/server.py --prod
 754268 ?        Sl   00:05:08  |   \_ /var/www/fernand/env/bin/python /var/www/fernand/www/server.py --prod
 754277 ?        S    00:00:00  |   \_ /var/www/fernand/env/bin/python -c from multiprocessing.resource_tracker import main;main(16)
 754279 ?        S    00:00:03  |   \_ /var/www/fernand/env/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=21) --multiprocessing-fork
 755976 ?        Z    00:00:02  |   \_ [python] <defunct>
 756031 ?        Z    00:00:18  |   \_ [python] <defunct>

The last line is interesting! The Python script is in <defunct> state for the Process that failed to stop properly (error above). We can also see that the processes 755976 and 658091 are also defunct.

cnicodeme commented 1 year ago

Triggering a reload via the inspector seems to do the job:

curl -X POST http://127.0.0.1:6457/reload

And in the logs:

[2023-03-10 14:07:08 +0000] [754279] [INFO] Incoming inspector action: reload
[2023-03-10 14:07:09 +0000] [765435] [INFO] connection open
[2023-03-10 14:07:10 +0000] [765435] [INFO] connection open
[2023-03-10 14:07:11 +0000] [765435] [INFO] connection open
[2023-03-10 14:07:11 +0000] [765435] [INFO] connection open

And the new state from the inspector:

{
    "meta": {
        "action": "info"
    },
    "result": {
        "info": {
            "mode": "production, goin' fast single worker",
            "server": "unknown",
            "python": "3.9.2",
            "platform": "Linux-5.10.0-19-cloud-amd64-x86_64-with-glibc2.31",
            "packages": ["sanic==22.12.0", "sanic-routing==22.8.0"],
            "extra": {}
        },
        "workers": {
            "Sanic-Main": {
                "pid": 754265
            },
            "Sanic-Server-0-0": {
                "server": true,
                "state": "ACKED",
                "pid": 765435,
                "start_at": "2023-03-09T17:09:17.486716+00:00",
                "starts": 7,
                "restart_at": "2023-03-10T14:07:08.334144+00:00"
            },
            "Sanic-Inspector-0": {
                "server": false,
                "state": "STARTED",
                "pid": 754279,
                "start_at": "2023-03-09T17:09:17.503364+00:00",
                "starts": 1
            }
        }
    }
}

The server has restarted properly like this.

ahopkins commented 1 year ago

Wow, lots of good info to dig into. I won't have a chance until Sunday, but I'll take a look then.

cnicodeme commented 1 year ago

If that can helps, I'm glad its helpful.

On my side, what I did was remove the watchdog waiting on the "reload" file, and trigger a reload directly via the INSPECTOR. The "deploy" script now sends a post requests to http://localhost:6457/reload instead of touching the reload file.

ahopkins commented 1 year ago

I would be curious if you continue to see the problem after changing to the INSPECTOR. Just a quick word of caution since it sounds like you are in a production environment: if it is a publicly accessible URL, make sure to protect it.

cnicodeme commented 1 year ago

From my understanding, the default host for the inspector is set to 127.0.0.1. I think it shouldn't be available from outside if that's the case.

ahopkins commented 1 year ago

From my understanding, the default host for the inspector is set to 127.0.0.1. I think it shouldn't be available from outside if that's the case.

Correct.

sanic-org / sanic