Open SylphiaWindy opened 6 years ago
epoll_wait(3, [{EPOLLOUT, {u32=170, u64=8589934762}}, {EPOLLOUT, {u32=370, u64=8589934962}}, {EPOLLOUT, {u32=491, u64=8589935083}}, {EPOLLOUT, {u32=361, u64=8589934953}}], 1023, 11) = 4
read(361, 0x2dc4213, 5) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(361, FIONBIO, [1]) = 0
read(491, 0x5dee050, 11) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(491, FIONBIO, [1]) = 0
read(370, 0x2e47d80, 11) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(370, FIONBIO, [1]) = 0
read(170, 0x6230b30, 11) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(170, FIONBIO, [1]) = 0
epoll_wait(3, [{EPOLLOUT, {u32=170, u64=8589934762}}, {EPOLLOUT, {u32=370, u64=8589934962}}, {EPOLLOUT, {u32=491, u64=8589935083}}, {EPOLLOUT, {u32=361, u64=8589934953}}], 1023, 10) = 4
read(361, 0x2dc4213, 5) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(361, FIONBIO, [1]) = 0
read(491, 0x5dee050, 11) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(491, FIONBIO, [1]) = 0
read(370, 0x2e47d80, 11) = -1 EAGAIN (Resource temporarily unavailable)
That's interesting. It's telling epoll that it wants to write (EPOLLOUT
) to these file descriptors, but then when epoll says they're writeable all it does is read. It also looks like it's trying to read 5 or 11 bytes at a time, unlike the 4096 in your code snippet. Is there something else going on here (maybe in pynsq)?
Are you using SSL? This kind of confusion about whether a stream is reading or writing is somewhat more common in SSL code although I haven't seen a bug like that in Tornado in a very long time.
Yes, we were using SSL, but even we disabled SSL, this issue also occurred. Socket 370 was a normal app client connection, not from pynsq.
Is it possible that my code caused this issue?
Is it possible that my code caused this issue?
It's always possible, especially since I haven't heard of anyone else having this problem. But if your code looks like what you've pasted above, it looks fine to me and this bug seems more likely to be somewhere inside the IOStream.
How to write code to cause this problem: request to write but do read instead?
What I've done in my code was:
No 5 or 11 bytes read in my code.
Just tested, start app server, make only one client connection.
epoll_wait(3, [{EPOLLOUT, {u32=17, u64=8589934609}}], 1023, 92) = 1
read(17, 0x3486240, 11) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(17, FIONBIO, [1]) = 0
epoll_wait(3, [{EPOLLOUT, {u32=17, u64=8589934609}}], 1023, 92) = 1
read(17, 0x3486240, 11) = -1 EAGAIN (Resource temporarily unavailable)
ioctl(17, FIONBIO, [1]) = 0
How to write code to cause this problem: request to write but do read instead?
I don't know. I think you'll just need to keep hacking on your application, trying to reduce the problem to the minimum reproducible code. You might try other tornado versions or other operating system versions to see if that makes a difference.
Ok, I'll report later.
Correction: with SSL disabled, one client connection didn't reproduce this problem, only when SSL was enabled.
I found the cause of the problem.
I've not read tornado test code before, and it turns out that handle_stream
is being called just after connection made, before ssl handshake. In my code, I sent a welcome message to client in handle_stream
which caused the problem.
In test code there is a test_wait_for_handshake_callback
which seems meet my needs.
But there's a question, in my design, code in handle_stream
don't care if it was a ssl connection, and wait_for_handshake
is a SSLIOStream
method, not an override of IOStream
class. How about make waiting ssl handshake inside before handle_stream
code?
Ah, that explains it. In protocols like HTTP where the client sends the first message, you wouldn't see this, but it's possible in protocols that work the other way around. Nice work tracking this down.
Is there any reason why TCPServer shouldn't call wait_for_handshake for you before calling handle_stream? I guess one reason is that if you want to block an abusive IP address you'd want to have the ability to do that before going through the relatively expensive handshake process.
Instead of calling wait_for_handshake automatically, we could change the behavior of IOStream.write()
when the handshake is not completed. The data passed should be buffered and we shouldn't tell the event loop we want to write until the handshake is done. I actually thought we already did that for use with client-side connections.
My mistake, like you said, I mean TCPServer should wait for ssl handshake completed before calling handle_stream
, seems asyncio does.
Yesterday, after I found the cause, the first version of my fix was:
async def handle_stream(self, io_stream, addr):
async def handle_new_connection(stream, address):
pass # original handle_stream code goes here
if isinstance(io_stream, SSLIOStream):
await io_stream.wait_for_handshake()
await handle_new_connection(io_stream, addr)
CPU usage went down, but it lead to another big problem, too many open files
.
I saw over 60k fds under /proc/*/fd
, most of them were socket fd, but netstat
only showed ~100 established connections along with ~10 half open connections. I didn't dig deeper about how this happened, just made a second version fix:
async def handle_stream(self, io_stream, addr):
DEFAULT_SSL_HANDSHAKE_TIMEOUT = 10
async def handle_new_connection(stream, address):
pass # original handle_stream code goes here
if isinstance(io_stream, SSLIOStream):
try:
co = io_stream.wait_for_handshake()
await with_timeout(IOLoop.current().time() + DEFAULT_SSL_HANDSHAKE_TIMEOUT,
co, quiet_exceptions=(IOError, StreamClosedError))
except TimeoutError:
if not io_stream.closed():
io_stream.close()
return
await handle_new_connection(io_stream, addr)
Problem solved.
We have built a app with tornado 4.5.3, along with pynsq 0.8.2(which was not compatible with tornado 5), running on CentOS 7.4 with default kernel 3.10.0-693.21.1.el7.x86_64. CPU usage came up to 100% soon after app started, with ~100 connections. We are using tornado's
TCPServer
class andIOStream.read_bytes(4096, partial=True)
to implement our async tcp server, something likeMany lines like
read(370, 0x2e47d80, 11) = -1 EAGAIN (Resource temporarily unavailable)
in strace result, and strace -c result was:A tornado issue?