smol-rs / async-io

Async I/O and timers
Apache License 2.0
449 stars 67 forks source link

async-io reactor seems to cause I/O to hung with TLS #32

Closed sehz closed 4 years ago

sehz commented 4 years ago

I have a simple TLS proxy which is using futures_lite::io::copy to copy source to dest and dest to source. The test does 10 iteration of sending message to between client and server. However, only first iteration works, during second iteration, client send seems to be block

Oct 01 04:48:53.968 DEBUG proxy: client: sleep to give server chance to come up    
Oct 01 04:48:53.968 DEBUG proxy: server: binding    
Oct 01 04:48:53.968 DEBUG proxy: server: successfully binding. waiting for incoming    
Oct 01 04:48:54.168 DEBUG proxy: client: trying to connect    
Oct 01 04:48:54.177 DEBUG proxy: client: got connection. waiting    
Oct 01 04:48:54.177 DEBUG proxy: client: loop 0 sending test message    
Oct 01 04:48:54.178 DEBUG proxy: server: new incoming connection    
Oct 01 04:48:54.178 DEBUG proxy: server: loop 0, received from client: 8 bytes    
Oct 01 04:48:54.178 DEBUG proxy: sever: send back reply: message0reply    
Oct 01 04:48:54.178 DEBUG proxy: client: loop 0, received reply back    
Oct 01 04:48:54.178 DEBUG proxy: client: loop 1 sending test message  

To reproduce this:

git clone https://github.com/infinyon/flv-tls-proxy
cd flv-tls-proxy
RUST_LOG=proxy=debug  cargo test test_proxy

In trace mode, log shows following:

TRACE tokio_util::codec::framed_read: frame decoded from buffer    
Oct 01 04:52:52.420 DEBUG proxy: client: loop 0, received reply back    
Oct 01 04:52:52.420 DEBUG proxy: client: loop 1 sending test message    
Oct 01 04:52:52.420 TRACE tokio_util::codec::framed_write: flushing framed transport    
Oct 01 04:52:52.420 TRACE tokio_util::codec::framed_write: writing; remaining=8    
Oct 01 04:52:52.420 TRACE async_io::driver: main_loop: sleeping for 1000 us    
Oct 01 04:52:52.420 TRACE polling::epoll: interest: epoll_fd=3, fd=8, ev=Event { key: 2, readable: true, writable: false }    
Oct 01 04:52:52.420 TRACE polling::epoll: new events: epoll_fd=3, res=1    
Oct 01 04:52:52.420 TRACE tokio_util::codec::framed_write: framed transport flushed    
Oct 01 04:52:52.420 TRACE polling::epoll: interest: epoll_fd=3, fd=4, ev=Event { key: 18446744073709551615, readable: true, writable: false }    
Oct 01 04:52:52.420 TRACE tokio_util::codec::framed_read: attempting to decode a frame    
Oct 01 04:52:52.420 TRACE async_io::reactor: react: 1 ready wakers    
Oct 01 04:52:52.420 TRACE async_io::driver: block_on: stops hogging the reactor    
Oct 01 04:52:52.420 TRACE polling: Poller::notify()    
Oct 01 04:52:52.420 TRACE polling::epoll: notify: epoll_fd=3, event_fd=4    
Oct 01 04:52:52.420 TRACE async_io::driver: main_loop: notified    
Oct 01 04:52:52.420 TRACE async_io::driver: block_on: waiting on I/O    
Oct 01 04:52:52.420 TRACE async_io::driver: main_loop: sleeping for 50 us    
Oct 01 04:52:52.420 TRACE async_io::reactor: process_timers: 0 ready wakers    
Oct 01 04:52:52.420 TRACE polling::epoll: interest: epoll_fd=3, fd=7, ev=Event { key: 1, readable: true, writable: false }    
Oct 01 04:52:52.420 TRACE polling: Poller::wait(_, None)    
Oct 01 04:52:52.420 TRACE async_io::driver: block_on: sleep until notification    
Oct 01 04:52:52.420 TRACE polling::epoll: interest: epoll_fd=3, fd=9, ev=Event { key: 3, readable: true, writable: false }    
Oct 01 04:52:52.420 TRACE polling::epoll: wait: epoll_fd=3, timeout=None    
Oct 01 04:52:52.421 TRACE async_io::driver: main_loop: sleeping for 75 us    
Oct 01 04:52:52.421 TRACE polling::epoll: interest: epoll_fd=3, fd=5, ev=Event { key: 18446744073709551615, readable: true, writable: false }    
Oct 01 04:52:52.421 TRACE async_io::driver: block_on: sleep until notification    
Oct 01 04:52:52.421 TRACE polling::epoll: new events: epoll_fd=3, res=1    
Oct 01 04:52:52.421 TRACE polling::epoll: interest: epoll_fd=3, fd=4, ev=Event { key: 18446744073709551615, readable: true, writable: false }    
Oct 01 04:52:52.421 TRACE async_io::reactor: process_timers: 0 ready wakers    
Oct 01 04:52:52.421 TRACE async_io::driver: main_loop: sleeping for 100 us    
Oct 01 04:52:52.421 TRACE async_io::reactor: react: 0 ready wakers    
Oct 01 04:52:52.421 TRACE async_io::driver: block_on: stops hogging the reactor    
Oct 01 04:52:52.421 TRACE async_io::driver: main_loop: notified    
Oct 01 04:52:52.421 TRACE async_io::driver: main_loop: waiting on I/O    
Oct 01 04:52:52.421 TRACE async_io::reactor: process_timers: 0 ready wakers    
Oct 01 04:52:52.421 TRACE polling: Poller::wait(_, None)    
Oct 01 04:52:52.421 TRACE polling::epoll: wait: epoll_fd=3, timeout=None    
Oct 01 04:52:52.421 TRACE polling::epoll: interest: epoll_fd=3, fd=5, ev=Event { key: 18446744073709551615, readable: true, writable: false }  
ghost commented 4 years ago

The test client first does TCP connect, then TLS connect, and then wraps the stream in Framed. That seems correct to me.

But the test server first does TCP accept, then wraps the stream in Framed, and finally does TLS accept. But I think it should first do TLS accept and then wrap it in Framed. Is that the problem?

sehz commented 4 years ago

I removed framed from test code to reduce confusion. Note that server doesn't do any TLS, it's proxy that perform TLS termination and forward non TLS packets to server and reverse send back to client

ghost commented 4 years ago

Thank you, that helps a bit. Still, I feel lost in the code - there's a lot going on and I find it hard to identify where the problematic part actually is. Do you think we could shrink the code further, hopefully so that it fits into a single file? Also, can we replace fluvio-future with async-net?

ghost commented 4 years ago

Also, I've been running into some issues here: https://github.com/nats-io/nats.rs/pull/70#issuecomment-702246138

I wonder if this is a problem with rustls. Maybe it'd be worth trying to replace async-tls with async-native-tls.

sehz commented 4 years ago

I doubt this is issue with rustls. Similar test works fine without proxy (between TLS server and client).

I have update the test code to use async-net and copy the code from library to make easier to understand

ghost commented 4 years ago

I replaced async-net with tokio and am getting the exact same behavior, which makes me believe it is indeed an issue with rustls, or maybe even async-tls. :)

https://github.com/stjepang/flv-tls-proxy/tree/tokio

sehz commented 4 years ago

That's interesting. Or maybe it's issue with copy method?

sehz commented 4 years ago

Here is code for TlsStream: https://github.com/async-rs/async-tls/blob/master/src/server.rs. Anything obvious? I can open issue on that repo

ghost commented 4 years ago

It's not the copy method. I just switched to tokio::io::copy() and nothing changed.

ghost commented 4 years ago

By the way, what OS are you using? I'm on macOS.

sehz commented 4 years ago

This was tested on linux ubuntu but I also tested on Mac as well

ghost commented 4 years ago

I'm going to close this since the issue seems to be somewhere else, but keep me posted if you find something new!

sehz commented 4 years ago

After trying various combinations, found out that rustls or async-tls freezes when spawn is used. It works when zip is used. Unfortunately, other parts of code needs to use spawn.

Fortunately, async-native-tls works.

Thanks for troubleshooting this issue.