Closed prymitive closed 11 years ago
From monitoring POV it would be great if uWSGI could know that worker was killed by OOM (if there is a way other than parsing dmes/logs) and track number of such events in stats socket/carbon.
Another idea is to add stats for cgroup memory usage to the carbon graphs generated by uWSGI, I'll try to make a patch for that.
This should not happens, as the master will receive notification of each death (in fact at 19:17:47, pid 18535 dies). Are you sure NS_PID is not faking you ? (the pid reported under a namspace are different from the ones of the real system).
Regarding metrics by cgroup file is a great idea, but please, wait for the first commit of the custom metric subsystem as i would like to have a configurable system like:
file-metric = memerrors /cgroup/foobar/memory.failcnt
You are right that I've checked pid in the wrong system, uWSGI reports pid from the app namespace and I was checking in host systems pid namespace. But still I had no running worker for this app, so I'm sure that uWSGI was thinking that it had one worker running while there was none. I've increased memory limit for this app, but it should be easy to reproduce it, I'll just connect to dropbear and run something that allocates all available memory (like ram memory tester). Anything particular I should look for?
finding mappings from the host system and the guest's pid should clarify the situation. If you manage to reproduce the error, report ps aux for both systems and json stats
Regarding dropbear, would not be better to attach it using the smart subsystem (setting the pidfile) ?
It's hard to reproduce that using different app but during debugging I've noticed something weird:
Oct 20 21:30:31 localhost app: DAMN ! worker 1 (pid: 1053) died, killed by signal 9 :( trying respawn ...
Oct 20 21:30:31 localhost app: Respawned uWSGI worker 1 (new pid: 1252)
Oct 20 21:30:31 localhost app: subprocess 1252 exited by signal 9
After that I don't have any more workers running and stats socket tells me that I have idle worker with pid 1053
Using smart-attach-daemon fixed dropbear issue, I used dumb spawner since I didn't wanted dropbear to still run after my vassal is stopped but if needed I can handle that in my management scripts.
Looks like race condition between worker, daemons death checks and worker pid value updates(?), maybe workers pid is not updated right away after respawn and so if it dies very quickly before uWSGI reads new value, than worker death checks don't catch it (so no respawn) and daemon death checks identify it as some extra subprocess getting killed. Maybe it would be enough if worker pid was updated directly in uwsgi_respawn_worker just after forking new worker process? I can't find when new pid is written to uwsgi.workers[n].pid
makes sense, i will check it, reproducing it should be pretty easy
ok i was able to reproduce it, i will try a fix
Can you try with the latest commit ?
I'm testing it with the app I've spotted the issue with, I also reverted to dump attach-daemon since it looks like that this issue was much easier to spot if dropbear was constantly trying to respawn. I didn't happen very often so it will take some time before I can confirm that it's fixed or not, I'll report back later.
I'm seeing quick kills just after respawn, but so far workers are being respawned again properly every time
Oct 21 11:43:30 localhost app: DAMN ! worker 1 (pid: 1801) died, killed by signal 9 :( trying respawn ...
Oct 21 11:43:30 localhost app: Respawned uWSGI worker 1 (new pid: 2046)
Oct 21 11:43:30 localhost app: DAMN ! worker 1 (pid: 2046) died, killed by signal 9 :( trying respawn ...
Oct 21 11:43:30 localhost app: Respawned uWSGI worker 1 (new pid: 2048)
If I can't hit this issue in next hour than we can safely assume it's fixed.
I think we can assume this is fixed, my crons, workers and daemons were killed > 200 times and it works fine. Thanks
@prymitive I am seeing the below errors repeatedly. Any thoughts how to fix it?
respawned uWSGI http 1 (pid: 4361) respawned uWSGI http 1 (pid: 4362)
So I should have worker process with pid 18535 but there is no such process. My requests are queuing in backlog and cheaper busyness will finally spawn new worker.
I have those messages in my logs:
But dmesg tells me that my workers are being killed by OOM killer so maybe uWSGI fails to respaw them due to low memory in vassal cgroup? It looks that way since I have most OOM killer messages are doubled so it newly spawned worker is probably being killed just after starting (there are few ruby cron tasks attached and they probably eat all memory).
So the issue seems to be that uWSGI doesn't detect condition where there is no memory available to respawn worker that was killed by OOM, and new worker gets killed right away (?).
dropbear issues are probably separate thing, maybe uWSGI is trying to restart it while it shouldn't since the old process still runs fine