uWSGI thinks worker is running but it is not

prymitive commented 11 years ago

{
    "cwd": "/home/app", 
    "daemons": [
        {
            "cmd": "/usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5054 -E -F -I 300", 
            "pid": 19855, 
            "respawns": 447
        }
    ], 
    "gid": 33, 
    "listen_queue": 5, 
    "listen_queue_errors": 0, 
    "load": 5, 
    "locks": [
        {
            "user 0": 0
        }, 
        {
            "signal": 0
        }, 
        {
            "filemon": 0
        }, 
        {
            "timer": 0
        }, 
        {
            "probe": 0
        }, 
        {
            "rbtimer": 0
        }, 
        {
            "cron": 0
        }, 
        {
            "rpc": 0
        }
    ], 
    "pid": 1, 
    "uid": 33, 
    "version": "1.3.1-dev-a3e9a28", 
    "workers": [
        {
            "apps": [
                {
                    "chdir": "", 
                    "exceptions": 0, 
                    "id": 0, 
                    "modifier1": 7, 
                    "mountpoint": "", 
                    "requests": 5, 
                    "startup_time": 12
                }
            ], 
            "avg_rt": 39, 
            "cores": [
                {
                    "id": 0, 
                    "in_request": 0, 
                    "requests": 294
                }
            ], 
            "delta_requests": 17, 
            "exceptions": 0, 
            "harakiri_count": 0, 
            "id": 1, 
            "last_spawn": 1350706679, 
            "pid": 11360, 
            "requests": 294, 
            "respawn_count": 38, 
            "rss": 226373632, 
            "running_time": 628263, 
            "signals": 0, 
            "status": "cheap", 
            "tx": 5178, 
            "vsz": 380694528
        }, 
        {
            "apps": [
                {
                    "chdir": "", 
                    "exceptions": 0, 
                    "id": 0, 
                    "modifier1": 7, 
                    "mountpoint": "", 
                    "requests": 0, 
                    "startup_time": 12
                }
            ], 
            "avg_rt": 46, 
            "cores": [
                {
                    "id": 0, 
                    "in_request": 0, 
                    "requests": 244
                }
            ], 
            "delta_requests": 7, 
            "exceptions": 0, 
            "harakiri_count": 0, 
            "id": 2, 
            "last_spawn": 1350752068, 
            "pid": 18535, 
            "requests": 244, 
            "respawn_count": 20, 
            "rss": 226373632, 
            "running_time": 9405, 
            "signals": 0, 
            "status": "idle", 
            "tx": 0, 
            "vsz": 380694528
        }
    ]
}

So I should have worker process with pid 18535 but there is no such process. My requests are queuing in backlog and cheaper busyness will finally spawn new worker.

I have those messages in my logs:

Oct 20 18:45:05 localhost app: [uwsgi-daemons] throttling "/usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5054 -E -F -I 300" for 441 seconds
Oct 20 18:52:26 localhost app: [uwsgi-daemons] respawning "/usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5054 -E -F -I 300"
Oct 20 18:52:26 localhost app: [18370] Oct 20 18:52:26 Failed listening on '5054': Error listening: Address already in use
Oct 20 18:52:26 localhost app: [18370] Oct 20 18:52:26 premature exit: No listening ports available.
Oct 20 18:52:27 localhost app: [uwsgi-daemons] respawning "/usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5054 -E -F -I 300"
Oct 20 18:52:27 localhost app: [18372] Oct 20 18:52:27 Failed listening on '5054': Error listening: Address already in use
Oct 20 18:52:27 localhost app: [18372] Oct 20 18:52:27 premature exit: No listening ports available.
Oct 20 18:52:28 localhost app: [uwsgi-daemons] throttling "/usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5054 -E -F -I 300" for 443 seconds
Oct 20 18:54:28 localhost app: DAMN ! worker 2 (pid: 16100) died, killed by signal 9 :( trying respawn ...
Oct 20 18:54:28 localhost app: Respawned uWSGI worker 2 (new pid: 18535)
Oct 20 18:54:28 localhost app: [uwsgi-daemons] respawning "/usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5054 -E -F -I 300"
Oct 20 18:54:28 localhost app: [18537] Oct 20 18:54:28 Not backgrounding
Oct 20 18:56:52 localhost app: announcing my loyalty to the Emperor...
Oct 20 19:17:47 localhost app: DAMN ! worker 2 (pid: 18535) died, killed by signal 9 :( trying respawn ...
Oct 20 19:17:47 localhost app: Respawned uWSGI worker 2 (new pid: 19852)
Oct 20 19:17:48 localhost app: [uwsgi-daemons] respawning "/usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5054 -E -F -I 300"
Oct 20 19:17:48 localhost app: [19854] Oct 20 19:17:48 Failed listening on '5054': Error listening: Address already in use
Oct 20 19:17:48 localhost app: [19854] Oct 20 19:17:48 premature exit: No listening ports available.

But dmesg tells me that my workers are being killed by OOM killer so maybe uWSGI fails to respaw them due to low memory in vassal cgroup? It looks that way since I have most OOM killer messages are doubled so it newly spawned worker is probably being killed just after starting (there are few ruby cron tasks attached and they probably eat all memory).

Oct 20 17:16:51 a178 kernel: [4442785.953510] Killed process 12610 (uwsgi) total-vm:371768kB, anon-rss:219004kB, file-rss:2060kB
Oct 20 17:16:51 a178 kernel: [4442786.042248] Killed process 12613 (uwsgi) total-vm:371768kB, anon-rss:219000kB, file-rss:232kB
Oct 20 17:37:06 a178 kernel: [4443998.129133] Killed process 12640 (uwsgi) total-vm:371768kB, anon-rss:219004kB, file-rss:2064kB
Oct 20 17:37:06 a178 kernel: [4443998.188021] Killed process 28489 (uwsgi) total-vm:371768kB, anon-rss:219000kB, file-rss:232kB
Oct 20 17:37:09 a178 kernel: [4444000.896209] Killed process 2002 (uwsgi) total-vm:371768kB, anon-rss:219000kB, file-rss:1864kB
Oct 20 18:08:41 a178 kernel: [4445888.114885] Killed process 2011 (uwsgi) total-vm:371768kB, anon-rss:219004kB, file-rss:2064kB
Oct 20 18:08:41 a178 kernel: [4445888.635278] Killed process 5014 (uwsgi) total-vm:371768kB, anon-rss:219000kB, file-rss:448kB
Oct 20 18:54:28 a178 kernel: [4448628.028641] Killed process 5017 (uwsgi) total-vm:371768kB, anon-rss:219004kB, file-rss:2064kB
Oct 20 18:54:28 a178 kernel: [4448628.145161] Killed process 21276 (uwsgi) total-vm:371768kB, anon-rss:219000kB, file-rss:232kB
Oct 20 19:17:47 a178 kernel: [4450024.092154] Killed process 22722 (uwsgi) total-vm:371768kB, anon-rss:219004kB, file-rss:2064kB
Oct 20 19:17:48 a178 kernel: [4450024.283838] Killed process 16056 (uwsgi) total-vm:371768kB, anon-rss:219000kB, file-rss:24kB

So the issue seems to be that uWSGI doesn't detect condition where there is no memory available to respawn worker that was killed by OOM, and new worker gets killed right away (?).

dropbear issues are probably separate thing, maybe uWSGI is trying to restart it while it shouldn't since the old process still runs fine

prymitive commented 11 years ago

From monitoring POV it would be great if uWSGI could know that worker was killed by OOM (if there is a way other than parsing dmes/logs) and track number of such events in stats socket/carbon.

Another idea is to add stats for cgroup memory usage to the carbon graphs generated by uWSGI, I'll try to make a patch for that.

unbit commented 11 years ago

This should not happens, as the master will receive notification of each death (in fact at 19:17:47, pid 18535 dies). Are you sure NS_PID is not faking you ? (the pid reported under a namspace are different from the ones of the real system).

Regarding metrics by cgroup file is a great idea, but please, wait for the first commit of the custom metric subsystem as i would like to have a configurable system like:

file-metric = memerrors /cgroup/foobar/memory.failcnt

prymitive commented 11 years ago

You are right that I've checked pid in the wrong system, uWSGI reports pid from the app namespace and I was checking in host systems pid namespace. But still I had no running worker for this app, so I'm sure that uWSGI was thinking that it had one worker running while there was none. I've increased memory limit for this app, but it should be easy to reproduce it, I'll just connect to dropbear and run something that allocates all available memory (like ram memory tester). Anything particular I should look for?

unbit commented 11 years ago

finding mappings from the host system and the guest's pid should clarify the situation. If you manage to reproduce the error, report ps aux for both systems and json stats

Regarding dropbear, would not be better to attach it using the smart subsystem (setting the pidfile) ?

prymitive commented 11 years ago

It's hard to reproduce that using different app but during debugging I've noticed something weird:

Oct 20 21:30:31 localhost app: DAMN ! worker 1 (pid: 1053) died, killed by signal 9 :( trying respawn ...
Oct 20 21:30:31 localhost app: Respawned uWSGI worker 1 (new pid: 1252)
Oct 20 21:30:31 localhost app: subprocess 1252 exited by signal 9

After that I don't have any more workers running and stats socket tells me that I have idle worker with pid 1053

prymitive commented 11 years ago

Using smart-attach-daemon fixed dropbear issue, I used dumb spawner since I didn't wanted dropbear to still run after my vassal is stopped but if needed I can handle that in my management scripts.

prymitive commented 11 years ago

Looks like race condition between worker, daemons death checks and worker pid value updates(?), maybe workers pid is not updated right away after respawn and so if it dies very quickly before uWSGI reads new value, than worker death checks don't catch it (so no respawn) and daemon death checks identify it as some extra subprocess getting killed. Maybe it would be enough if worker pid was updated directly in uwsgi_respawn_worker just after forking new worker process? I can't find when new pid is written to uwsgi.workers[n].pid

unbit commented 11 years ago

makes sense, i will check it, reproducing it should be pretty easy

unbit commented 11 years ago

ok i was able to reproduce it, i will try a fix

unbit commented 11 years ago

Can you try with the latest commit ?

prymitive commented 11 years ago

I'm testing it with the app I've spotted the issue with, I also reverted to dump attach-daemon since it looks like that this issue was much easier to spot if dropbear was constantly trying to respawn. I didn't happen very often so it will take some time before I can confirm that it's fixed or not, I'll report back later.

prymitive commented 11 years ago

I'm seeing quick kills just after respawn, but so far workers are being respawned again properly every time

Oct 21 11:43:30 localhost app: DAMN ! worker 1 (pid: 1801) died, killed by signal 9 :( trying respawn ...
Oct 21 11:43:30 localhost app: Respawned uWSGI worker 1 (new pid: 2046)
Oct 21 11:43:30 localhost app: DAMN ! worker 1 (pid: 2046) died, killed by signal 9 :( trying respawn ...
Oct 21 11:43:30 localhost app: Respawned uWSGI worker 1 (new pid: 2048)

If I can't hit this issue in next hour than we can safely assume it's fixed.

prymitive commented 11 years ago

I think we can assume this is fixed, my crons, workers and daemons were killed > 200 times and it works fine. Thanks

gvenka008c commented 8 years ago

@prymitive I am seeing the below errors repeatedly. Any thoughts how to fix it?

respawned uWSGI http 1 (pid: 4361) respawned uWSGI http 1 (pid: 4362)

unbit / uwsgi

uWSGI thinks worker is running but it is not #25