Open cillianderoiste opened 1 year ago
@cillianderoiste Try using py-spy's dump command to find out what the instances are doing when they are stuck.
Thanks, here's what I see:
Thread 1539468 (idle): "MainThread"
handle_write_event (waitress/wasyncore.py:514)
write (waitress/wasyncore.py:113)
poll (waitress/wasyncore.py:189)
loop (waitress/wasyncore.py:245)
run (waitress/server.py:322)
serve (waitress/__init__.py:19)
serve_paste (plone/recipe/zope2instance/ctl.py:917)
serve (plone/recipe/zope2instance/ctl.py:942)
serve (Zope2/Startup/serve.py:203)
run (Zope2/Startup/serve.py:217)
main (Zope2/Startup/serve.py:251)
<module> (Zope2/Startup/serve.py:255)
<module> (interpreter:342)
Thread 1539618 (idle): "zeostorage zeo client networking thread"
select (selectors.py:468)
_run_once (asyncio/base_events.py:1823)
run_forever (asyncio/base_events.py:570)
run (ZEO/asyncio/client.py:892)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 1539787 (idle): "waitress-0"
wait (threading.py:302)
handler_thread (waitress/task.py:72)
run (threading.py:870)
run (sentry_sdk/integrations/threading.py:67)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 1539788 (idle): "waitress-1"
wait (threading.py:302)
handler_thread (waitress/task.py:72)
run (threading.py:870)
run (sentry_sdk/integrations/threading.py:67)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 1539789 (idle): "waitress-2"
wait (threading.py:302)
handler_thread (waitress/task.py:72)
run (threading.py:870)
run (sentry_sdk/integrations/threading.py:67)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 1539791 (idle): "waitress-3"
wait (threading.py:302)
handler_thread (waitress/task.py:72)
run (threading.py:870)
run (sentry_sdk/integrations/threading.py:67)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 1539806 (idle): "slowlog_monitor"
wait (threading.py:302)
get (queue.py:170)
run (slowlog/monitor.py:85)
run (sentry_sdk/integrations/threading.py:67)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
We are using the latest version of waitress 2.1.2, this is where it seems to be stuck: https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/wasyncore.py#L514
I don't know how to debug this further, but I did notice that if I pin waitress to 2.0.0 I can still reproduce the issue, but with 1.4.4 I cannot.
waitress.wasyncore
is not public API.
Not seen this issue at all with a few Plone 6.0.0.2 setups that are actively in use, but the setup is smaller and uses only one zeoclient / zeoserver. One of those sites runs on a server with 3 other active Plone sites. Default recipe settings for threads and http-fast-listen . Could hint on a race conditions with multiple zeoclients?
@tschorr
@tschorr
I haven't used waitress for a while now. Maybe in addition to the py-spy dump you could try strace -p <pid>
on one of the hanging processes to get more information on what's going on.
and btw what is the output of ulimit -a
?
@tschorr
I haven't used waitress for a while now. Maybe in addition to the py-spy dump you could try
strace -p <pid>
on one of the hanging processes to get more information on what's going on.
https://github.com/Pylons/waitress/issues/396#issuecomment-1386163765
Here an analysis on the code.
@yurj I have read it :-)
Coming from here I'm still fighting with this even when http-fast-listen
is turned off. My strace
ouput looks like this:
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
and runs forever.
ulimit output:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256616
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 256616
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
would be glad for hints
@petschki before anything else did you ever try increasing the allowed number of open files? Sockets are files and 1024 is not really a lot.
The strace dump seems to confirm it is stuck in waitress, because asyncio
(ZEO
) is supposed to use poll
. You could try and toggle waitresses asyncore_use_poll
parameter and see if the strace dump switches to poll
.
For buildout-less Plone 6, I factored out (some time ago) this code snippet to here https://github.com/plone/waitress_fastlisten/blob/main/waitress_fastlisten.py
Probably the same problem may pop up.
@tschorr thanks for the pointers. Switching asyncore_use_poll = True
changes the output indeed to:
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
The site is running well but I will increase the number of open files and switch back to the default settings for now. Can this flag be set as environment or config option somehow?
@petschki this time it seems to be polling only for readable sockets.
Can this flag be set as environment or config option somehow?
Isn't it possible to set this in the paste config (zope.ini
/ wsgi.ini
)? But I haven't tried myself.
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
@petschki out of curiosity if you see this again could you inspect that one file descriptor that needs to become writable (14 in this case) with lsof -d 14
? If there's a lot of output, maybe the lines starting with python
will do.
@tschorr this is the lsof -d
of the current process which got stuck in select()
root@fischereiverband:~# lsof -d 14 | grep python
python3.11 113676 fischereiverband 14u IPv4 3906056 0t0 TCP localhost:8090->localhost:54788 (ESTABLISHED)
python3.11 144977 fischereiverband 14u sock 0,9 0t0 4367294 protocol: TCP
process:
root@fischereiverband:~# ps uax | grep 144977
fischer+ 144977 12.1 0.3 595628 229708 ? Rl 01:11 126:03 /home/fischereiverband/.pyenv/versions/3.11.1/envs/python-3.11.1/bin/python3.11 /home/fischereiverband/plone6/parts/zeoclient4/bin/interpreter /home/fischereiverband/plone6/eggs/Zope-5.7.3-py3.11.egg/Zope2/Startup/serve.py /home/fischereiverband/plone6/parts/zeoclient4/etc/wsgi.ini
strace:
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999999})
Isn't it possible to set this in the paste config (
zope.ini
/wsgi.ini
)? But I haven't tried myself.
Yes you can set asyncore_use_poll = true
in the [server:main]
section of wsgi.ini
poll([{fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}], 2, 1000) = 0 (Timeout)
The site is running well
So assuming that changing the allowed number of open files didn't fix the issue (@petschki can you confirm?), but switching to asyncore_use_poll = true
does, one possibility to fix this would be to add this setting in p.r.zope2instance
e.g. here. To me this looks like a reasonable default but I might be missing something.
@tschorr this is the lsof -d of the current process which got stuck in select()
python3.11 113676 fischereiverband 14u IPv4 3906056 0t0 TCP localhost:8090->localhost:54788 (ESTABLISHED) python3.11 144977 fischereiverband 14u sock 0,9 0t0 4367294 protocol: TCP
It's waiting for a client connection to become writable and select
is indeed reporting a writable connection. So why doesn't waitress finish writing the response and drop the connection? Maybe to further investigate this you could start adding log statements in waitress, e.g. here and/or here.
I've added log statements to the recommended lines in waitress/wasyncore.py and keep you informed.
python3.11 144977 fischereiverband 14u sock 0,9 0t0 4367294 protocol: TCP
Since your lsof
output showed two python processes each with a file descriptor 14... but you ran ps
on 144977
, I wanted to point out that this shows a socket that is not connected, and hasn't had listen()
, bind()
or connect()
called on it.
Add debug statements that show what sockets are being created in the fast-listen loop, then turned into a string, which is then turned back into a list of sockets that are passed to waitress so you know which sockets waitress is supposed to listen on.
Then I would recommend adding a debug statement to print the sockets that waitress has accepted here:
https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/server.py#L304
or here:
https://github.com/Pylons/waitress/blob/v2.1.2/src/waitress/channel.py#L56
So you know which sockets that waitress has accepted.
Next up, you might want to log what type is self
in handle_write_event
, since that will tell you whether it's a waitress channel or server.
Let me simplify the code down a bit, and what fast-listen
is doing here, it looks like this:
import socket
def prebind_socket():
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# We can ignore calling bind() cause we just care that this socket is in listening mode
print(f"pre_bound socket is: {s}")
s.listen(5)
print(f"Called listen on pre bound socket")
fileno = s.fileno()
return str(fileno)
def app_startup():
# We are now going to do app startup, which includes creating a new socket and
# connecting to a remote host (we simulate that by calling socket + close) to
# fetch some data
fetch_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# fetch_socket.connect()
print(f"Our fetch socket is: {fetch_socket}")
fetch_socket.close()
prebound_socket_fileno = prebind_socket()
print(prebound_socket_fileno)
# Simulate app startup doing a whole bunch of socket goodness
app_startup()
# Take our pre-bound socket created as dangling and turn it back into a socket
prebound_socket = socket.fromfd(
int(prebound_socket_fileno), socket.AF_INET, socket.SOCK_STREAM
)
print(f"Our socket we are listening on is: {prebound_socket}")
# This should block waiting for a connection to arrive
print("calling accept() on socket")
prebound_socket.accept()
We create the socket with dispatcher()
which is in scope for a little bit, turn it into a file descriptor number, after that we turn that into a long string of numbers. That is prebind_socket
above.
Then the next step is that we do app startup. Now the application startup does a bunch of socket stuff itself, it may make connections to the outside world, open files, all that fun stuff. During that time the garbage collector may or may not clean up the dispatcher()
that was temporarily created, and along that any related garbage (which includes the socket we created, since we don't store the actual socket anywhere else, just a stringified file descriptor number).
So in app_startup()
I did a mock call to socket to create a new socket that would connect to a remote service to fetch some information. Then after app_startup()
is complete, I use the previously prebound_socket_fileno
and turn that back into a socket, then I call accept()
on the socket since it was already ready for that since we called .listen()
.
This is the output:
% python3.11 race-socket-cleanup.py
pre_bound socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Called listen on pre bound socket
3
Our fetch socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Traceback (most recent call last):
File "/Users/xistence/Projects/temp/race-socket-cleanup.py", line 36, in <module>
prebound_socket = socket.fromfd(
^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 546, in fromfd
nfd = dup(fd)
^^^^^^^
OSError: [Errno 9] Bad file descriptor
That's because as you can see from the output, the prebound socket is on file descriptor 3, but since it goes out of scope and we return just the stringified number, Python garbage collection closes the socket. Then our fetch socket (in app_startup()
) creates a new socket, which gets the exact same file descriptor. App startup completes and we attempt to take our prebound socket and re-use it, but we can't since that file descriptor is not valid.
However what happens if we happen to create a whole bunch of sockets/files and thus they are valid?
prebound_socket_fileno = prebind_socket()
print(prebound_socket_fileno)
# Simulate app startup doing a whole bunch of socket goodness
app_startup()
other = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
print(f"Test socket: {other}")
# Take our pre-bound socket created as dangling and turn it back into a socket
prebound_socket = socket.fromfd(
int(prebound_socket_fileno), socket.AF_INET, socket.SOCK_STREAM
)
print(f"Our socket we are listening on is: {prebound_socket}")
# This should block waiting for a connection to arrive
print("calling accept() on socket")
prebound_socket.accept()
notice other
where we create a new socket, which stays alive since it is in scope. Now this is the output from the script run:
% python3.11 race-socket-cleanup.py
pre_bound socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Called listen on pre bound socket
3
Our fetch socket is: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Test socket: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
Our socket we are listening on is: <socket.socket fd=4, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
calling accept() on socket
Traceback (most recent call last):
File "/Users/xistence/Projects/temp/race-socket-cleanup.py", line 47, in <module>
prebound_socket.accept()
File "/opt/homebrew/Cellar/python@3.11/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 294, in accept
fd, addr = self._accept()
^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument
Calling .accept()
failed since the socket is not the original listen socket, but it has the same file descriptor so socket from fileno works because it can call dup2() on the file descriptor.
Anyway, this should hopefully showcase why the current code is bad. You are basically racing the Python garbage collection system to try and have
socket.fromfd(
int(prebound_socket_fileno), socket.AF_INET, socket.SOCK_STREAM
)
turn the file descriptor/socket created in dispatcher()
into a valid socket that you can pass to waitress. And if you are late, or the file descriptor has been re-used in the mean time for another socket, it'll fail in fun and interesting ways, or may not even be a valid socket.
The surprise to me is that apparently you are winning the race often enough for this to not have been an issue yet, but the code as it is written right now is wrong and should be fixed. Create the sockets, and hold on to them as a list of sockets, and don't do the socket -> fd -> stringify -> fd -> socket dance, it is unnecessary.
I still have no idea why or how .connected
would be False
, and why it would be stuck in a hard loop there since it should call handle_write
but I have no idea what the type of self
is there.
@bertjwregeer thanks for your hints! I've put some logging statements to your suggested places and try to help out here. But the original authors of the zeoclient control script might have more ideas whats going wrong here (@ale-rt @tschorr ?)
So I have this zeoclient which says this after startup: (lines 304, 306 are in waitress/server.py, line 57 is in waitress/channel.py)
2023-01-31 07:49:05,281 INFO [waitress:304][MainThread] self = <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>
2023-01-31 07:49:05,281 INFO [waitress:306][MainThread] self.accept() = (<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 56994)>, ('127.0.0.1', 56994))
2023-01-31 07:49:05,281 INFO [waitress:57][MainThread] server=<waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, sock=<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 56994)>, addr=('127.0.0.1', 56994), adj=<waitress.adjustments.Adjustments object at 0x7fdba1ca5bd0>, map={12: <waitress.trigger.trigger connected at 0x7fdba9ccf150>, 13: <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, 14: <waitress.channel.HTTPChannel 127.0.0.1:36024 at 0x7fdba9ccec10>}
with lsof -d 15 | grep python
python3.1 37728 fischereiverband 15u IPv4 824934 0t0 TCP localhost:8090->localhost:36164 (ESTABLISHED)
python3.1 37729 fischereiverband 15u sock 0,9 0t0 822209 protocol: TCP
Now I click through some folders and tried to delete one with 10k items and make a second request on the same zeoclient, the log says:
2023-01-31 08:11:16,110 INFO [waitress:304][MainThread] self = <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>
2023-01-31 08:11:16,110 INFO [waitress:306][MainThread] self.accept() = (<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 49162)>, ('127.0.0.1', 49162))
2023-01-31 08:11:16,110 INFO [waitress:57][MainThread] server=<waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, sock=<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 49162)>, addr=('127.0.0.1', 49162), adj=<waitress.adjustments.Adjustments object at 0x7fdba1ca5bd0>, map={12: <waitress.trigger.trigger connected at 0x7fdba9ccf150>, 13: <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, 14: <waitress.channel.HTTPChannel 127.0.0.1:36024 at 0x7fdba9ccec10>, 15: <waitress.channel.HTTPChannel connected 127.0.0.1:56994 at 0x7fdba82de650>}
but the lsof -d 16
(<socked.socket fd=16 ...>
?) has no python process:
superviso 2387 fischereiverband 16r FIFO 0,13 0t0 820045 pipe
varnishd 37733 fischereiverband 16r DIR 9,4 4096 38821481 /home/fischereiverband/plone6/parts/varnish-build/var/varnish/fischereiverband/_.vsm_child
cache-mai 37746 fischereiverband 16r DIR 9,4 4096 38821481 /home/fischereiverband/plone6/parts/varnish-build/var/varnish/fischereiverband/_.vsm_child
maybe also interesting for you, the strace -p 37729
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2419173}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36310, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2566080}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b362fc, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2638657}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2750479}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=2904310}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b362f8, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3053702}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36310, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3187325}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36310, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 1
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3334703}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdbb0b362fc, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36314, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=81276, tv_nsec=3407330}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fdbb0b36318, FUTEX_WAKE_PRIVATE, 1) = 0
select(15, [12 13], [14], [12 13 14], {tv_sec=1, tv_usec=0}) = 1 (out [14], left {tv_sec=0, tv_usec=999998})
and lsof -d 14 | grep python
which is obviously the zeoserver:
python3.1 37728 fischereiverband 14u IPv4 824932 0t0 TCP localhost:8090->localhost:36154 (ESTABLISHED)
python3.1 37729 fischereiverband 14u sock 0,9 0t0 816020 protocol: TCP
python3.1 37731 fischereiverband 14u sock 0,9 0t0 826721 protocol: TCP
But the original authors of the zeoclient control script might have more ideas whats going wrong here (@ale-rt @tschorr ?)
;-) nice try, but I really don't count myself as one of the original authors of the zeoclient control script. Github shows 20+ contributors, and that is only since 2007 when it moved there supposedly from svn.
but the
lsof -d 16
(<socked.socket fd=16 ...>
?) has no python process:
This is the socket used to accept new connections. It is yet not connected and lsof
may not consider it an open file. But the map
:
2023-01-31 08:11:16,110 INFO [waitress:57][MainThread] server=<waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, sock=<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1', 8082), raddr=('127.0.0.1', 49162)>, addr=('127.0.0.1', 49162), adj=<waitress.adjustments.Adjustments object at 0x7fdba1ca5bd0>, map={12: <waitress.trigger.trigger connected at 0x7fdba9ccf150>, 13: <waitress.server.TcpWSGIServer listening 0.0.0.0:8082 at 0x7fdba9ccf210>, 14: <waitress.channel.HTTPChannel 127.0.0.1:36024 at 0x7fdba9ccec10>, 15: <waitress.channel.HTTPChannel connected 127.0.0.1:56994 at 0x7fdba82de650>}
shows file descriptors 12, 13, 14, 15 and you see those in the select
line in the strace -p
output. You should be able to find python processes for them.
and lsof -d 14 | grep python which is obviously the zeoserver:
python3.1 37728 fischereiverband 14u IPv4 824932 0t0 TCP localhost:8090->localhost:36154 (ESTABLISHED) python3.1 37729 fischereiverband 14u sock 0,9 0t0 816020 protocol: TCP python3.1 37731 fischereiverband 14u sock 0,9 0t0 826721 protocol: TCP
Yes, and the socket monitored by waitress is obviously one of the unconnected sockets as @bertjwregeer has pointed out.
haha ... sry @tschorr didn't want to blame someone here. Here are the file descriptors:
root@fischereiverband:~# lsof -d 12,13,14,15 | grep python
python3.1 37728 fischereiverband 12r REG 9,4 220346931 38801191 /home/fischereiverband/plone6/var/filestorage/Data.fs
python3.1 37728 fischereiverband 13u IPv4 854933 0t0 TCP localhost:8090->localhost:49694 (ESTABLISHED)
python3.1 37728 fischereiverband 14u IPv4 859839 0t0 TCP localhost:8090->localhost:49700 (ESTABLISHED)
python3.1 37728 fischereiverband 15u IPv4 859843 0t0 TCP localhost:8090->localhost:49704 (ESTABLISHED)
python3.1 40446 fischereiverband 12r FIFO 0,13 0t0 854935 pipe
python3.1 40446 fischereiverband 13u IPv4 854936 0t0 TCP *:8082 (LISTEN)
python3.1 40447 fischereiverband 12r FIFO 0,13 0t0 833188 pipe
python3.1 40447 fischereiverband 13u IPv4 833189 0t0 TCP *:8083 (LISTEN)
python3.1 40448 fischereiverband 12r FIFO 0,13 0t0 842172 pipe
python3.1 40448 fischereiverband 13u IPv4 842173 0t0 TCP *:8084 (LISTEN)
python3.1 40450 fischereiverband 12r FIFO 0,13 0t0 851867 pipe
python3.1 40450 fischereiverband 13u IPv4 851868 0t0 TCP *:8085 (LISTEN)
haha ... sry @tschorr didn't want to blame someone here.
No worries :-D and did I mention I don't use waitress?
@petschki:
Coming from here I'm still fighting with this even when http-fast-listen is turned off.
@bertjwregeer:
Let me simplify the code down a bit, and what
fast-listen
is doing here
But the data provided was gathered with fast-listen
turned off and the problem persists?
and did I mention I don't use waitress?
no you didn't ... what do you use instead? bjoern
is mentioned here https://zope.readthedocs.io/en/latest/operation.html#waitress-the-default-and-recommended-choice for example ...
no you didn't ... what do you use instead?
I think I did. I'm using pyruvate (I'm the author). I'm not saying waitress is a bad choice though.
@tschorr if all the data is collected without fast-listen enabled then you've gotta show more information on how you are starting waitress, what parameters are provided, more information on what the code is doing.
Also your lsof commands are just looking for file descriptor, but could you please dump by pid instead? That's going to give you far better information. Your latest lsof command shows 5 pid's. If all of those are python processes are you forking() somewhere in the code? Can you show your ps output? Can you show a pstree output as well just in case there is a parent child relationship? Each thread technically gets a PID in waitress, but they are all owned by the main thread and shared with the others.
If you aren't using fast-listen, drop using the recipe's server_factory entirely since it's not netting you anything, and instead just launch waitress directly.
The fast-listen code is still broken though, that doesn't change.
@tschorr if all the data is collected without fast-listen enabled then you've gotta show more information
@bertjwregeer you are asking the wrong person - I'm not using waitress and I cannot collect any data on the issue. It's @petschki you need to ask.
Okay, but you claimed that fast-listen is turned off for this whole debug session which would invalidate the title of this ticket and makes the debugging session useless.
Still I urge you to fix the fast-listen code.
I am unsubscribing from this ticket now. Please don't use wasyncore
from waitress, it is not and never will be public API, and I will be moving it to _wasyncore
next release to make that even more clear.
Okay, but you claimed that fast-listen is turned off for this whole debug session
@bertjwregeer no, I didn't claim this. Note the question mark in my post. Shooting the messenger will not help to resolve this. Reading other people's posts maybe would.
I fully agree that the title isn't suitable for my debugging session at all ... sorry for that - my mistake. My setup has fast-listen
turned off the whole time. I also do not have enough knowledge of the zeoclient control script or the underlying waitress technology to help debugging further here.
Reading the last comment of @bertjwregeer seems that the zope/plone community should think about refactoring the control scripts using waitress though.
that the zope/plone community should think about refactoring the control scripts using waitress though.
The fast-listen
part that is using wasyncore.dispatcher
was added when the asyncore
module was still part of the python stdlib (so it was definitely 'public' then). I think I remember it was originally meant to be used for development. The import statement switched to using wasyncore
after asyncore
got removed from the stdlib and waitress 'vendored' it as wasyncore
. Obviously this wasn't considered an issue since fast-listen
only ever worked with waitress anyway. Eventually it became the default setting (why?) and eventually people wanted more than one prebound socket (not sure about the order of events but it doesn't matter).
If now waitress chooses to declare a deprecated stdlib module their private API (module comments in version 2.1.2 don't give any hint on that), a std socket could replace wasyncore.dispatcher
and of course the list of strings could be a list of sockets. Other refactoring options include changing the asyncore
import statement again to stick with the dispatcher, turning fast-listen
off by default or entirely remove the option.
All this will most likely not help to explain why this happens with fast-listen
turned off, why it doesn't happen with waitress 1.4.x or why setting asyncore_use_poll = true
also fixes the problem. But if nobody is interested in clarifying this (I'm not) then this is definitely an option.
Maybe it would be interesting to bypass zdaemon
to see if this changes anything. But you could use the runwsgi
script provided with Zope to try this (or maybe use pserve
) before starting to refactor this recipe.
@tschorr Does it work with pure WSGI, w/o buildout and w/o this recipe involved? I use https://pypi.org/project/waitress-fastlisten/ which is based on the code from here and did not run into any problems.
# This file was generated by "cookiecutter-zope-instance"
[app:zope]
use = egg:Zope#main
zope_conf = %(here)s/zope.conf
[server:main]
paste.server_factory = waitress_fastlisten:main
use = egg:waitress_fastlisten#main
fast-listen = 0.0.0.0:8080
threads = 4
clear_untrusted_proxy_headers = false
max_request_body_size = 1073741824
[filter:translogger]
use = egg:Paste#translogger
setup_console_handler = False
[pipeline:main]
pipeline =
translogger
egg:Zope#httpexceptions
zope
# ...
@jensens you need to ask @petschki .
I'll try it out next week in Innsbruck.
We've been troubled by an issue for the past few months where all instances on our machines get stuck at 100% CPU usage. A typical machine has 5 instances and zeo, 2 CPUs (2.4GHz) and 8G RAM. We could avoid the issue by stopping all instances and then starting each one slowly, waiting for the CPU load to go down before starting the next one. Restarting all instances via supervisor reliably reproduced the issue. As soon as one instance got into this state, starting or restarting additional instances would cause them to also use all available CPU. Today we tried adding
http-fast-listen = off
to the buildout configuration for the instances and now we can restart all instances without any problem.While investigating the issue we noticed that having a high CPU load when starting the instances made matters worse. For example, by running
dd if=/dev/zero of=/dev/null
three times in parallel, we could trigger the issue by starting 3 instances at the same time, rather than 5. We have also had this issue on a machine which runs 5 instances, each in a separate docker container.I attempted to use the plone docker image to make a reproducible test case, by restricting the available CPU for the container and then creating a high load (with dd, as above) before starting the instance, but it was more difficult than I had hoped.