rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.77k stars 12.5k forks source link

Using ToSocketAddrs seems to remember EMFILE on the same thread #47955

Open seanmonstar opened 6 years ago

seanmonstar commented 6 years ago

This was noticed in https://github.com/hyperium/hyper/issues/1422, where a user tried to trigger more connections than their allowed max file descriptors, and saw the EMFILE error. It was then noticed that afterwards, every call to to_socket_addrs that requires a DNS lookup would fail from then on. However, trying the same DNS lookup on a new thread would work fine.

I was able to reproduce this using just the standard library here:

use std::net::TcpStream;

fn main() {
    let cnt = 30_000; // adjust for your system
    let host = "localhost:3000"; // using "127.0.0.1:3000" doesn't have the same problem

    let mut sockets = Vec::with_capacity(cnt);
    for i in 0..cnt {
        match TcpStream::connect(host) {
            Ok(tcp) => sockets.push(tcp),
            Err(e) => {
                println!("error {} after {} connects", e, i);
                break;
            }
        }
    }

    drop(sockets);
    println!("closing all sockets");

    // sleep because why not
    ::std::thread::sleep(::std::time::Duration::from_secs(5));

    TcpStream::connect(host).unwrap();

    println!("end");
}

Just start up a local server, and try to run this program against it. Also, notice that if you change from "localhost" to "127.0.0.1", the issue doesn't show up.

sfackler commented 6 years ago

Using 127.0.0.1 doesn't touch DNS at all, so it makes sense that you wouldn't see the issue with it.

seanmonstar commented 6 years ago

Yea, I was providing some instructions on how we isolated it to likely being related to DNS.

cuviper commented 6 years ago

FWIW, it appears to work fine for me on Fedora 27 (glibc 2.26). I started the hyper hello example under ulimit -n 4096, then ran the reproducer under the default limit 1024:

error Device or resource busy (os error 16) after 1021 connects
closing all sockets
end

That's 16 == EBUSY, but strace shows me that the resolver indeed gets EMFILE trying to open /etc/hosts and the like. Still, I may not be reproducing the same problem or error recovery.

Also note that on_resolver_failure() has special behavior for glibc < 2.26. It may be that or related bugs lurking in glibc which hurt this case.

miquels commented 5 years ago

I have hit the same issue, and to test it I put together a small stand-alone reproducer: https://gist.github.com/miquels/c47316f7b19a0af3d9927bafef94de35

If I build and run this on debian 9/stretch, it shows the buggy behaviour. If I then run the same binary on debian 10/buster (aka "testing", not released yet) it works as expected.

debian 9/stretch glibc version: 2.24 debian 10/buster glibc version: 2.28

I ported the same reproducer to C, and sure enough, it shows the same behaviour. This is a bug in glibc that was fixed between 2.25 and 2.28.

If I set h_errno = 0 after I get an error, the problem goes away. As expected, it is some global state that is not reset.

I can fix this in the rust reproducer as well, by adding:

    extern { fn __h_errno_location() -> *mut i32; }
    unsafe { *__h_errno_location() = 0 }

So if on_resolver_failure() added this workaround, it would probably solve this issue.