Open seanmonstar opened 6 years ago
Using 127.0.0.1
doesn't touch DNS at all, so it makes sense that you wouldn't see the issue with it.
Yea, I was providing some instructions on how we isolated it to likely being related to DNS.
FWIW, it appears to work fine for me on Fedora 27 (glibc 2.26). I started the hyper hello
example under ulimit -n 4096
, then ran the reproducer under the default limit 1024:
error Device or resource busy (os error 16) after 1021 connects
closing all sockets
end
That's 16 == EBUSY
, but strace shows me that the resolver indeed gets EMFILE
trying to open /etc/hosts
and the like. Still, I may not be reproducing the same problem or error recovery.
Also note that on_resolver_failure()
has special behavior for glibc < 2.26. It may be that or related bugs lurking in glibc which hurt this case.
I have hit the same issue, and to test it I put together a small stand-alone reproducer: https://gist.github.com/miquels/c47316f7b19a0af3d9927bafef94de35
If I build and run this on debian 9/stretch, it shows the buggy behaviour. If I then run the same binary on debian 10/buster (aka "testing", not released yet) it works as expected.
debian 9/stretch glibc version: 2.24 debian 10/buster glibc version: 2.28
I ported the same reproducer to C, and sure enough, it shows the same behaviour. This is a bug in glibc that was fixed between 2.25 and 2.28.
If I set h_errno = 0
after I get an error, the problem goes away. As expected, it is some global state that is not reset.
I can fix this in the rust reproducer as well, by adding:
extern { fn __h_errno_location() -> *mut i32; }
unsafe { *__h_errno_location() = 0 }
So if on_resolver_failure() added this workaround, it would probably solve this issue.
This was noticed in https://github.com/hyperium/hyper/issues/1422, where a user tried to trigger more connections than their allowed max file descriptors, and saw the EMFILE error. It was then noticed that afterwards, every call to
to_socket_addrs
that requires a DNS lookup would fail from then on. However, trying the same DNS lookup on a new thread would work fine.I was able to reproduce this using just the standard library here:
Just start up a local server, and try to run this program against it. Also, notice that if you change from
"localhost"
to"127.0.0.1"
, the issue doesn't show up.