Closed travisbell closed 5 months ago
This is an interesting error. Async provides asynchronous address resolution, but this shouldn't be used as we are just resolving the IP address of the inbound connection, an operation that shouldn't involve any DNS lookup at all.
In the first instance, we could consider rescuing any lookup errors and just bypassing the socket IP address. However, this isn't strictly a good idea.
Perhaps a better option would be to try and collect/log the circumstances which cause this error to occur so we can investigate further.
Let me see if I can reproduce the issue.
I have not tried to reproduce this yet, but I've checked the implementation.
This is the code invoked by Addrinfo#ip_address
:
/*
* call-seq:
* addrinfo.ip_address => string
*
* Returns the IP address as a string.
*
* Addrinfo.tcp("127.0.0.1", 80).ip_address #=> "127.0.0.1"
* Addrinfo.tcp("::1", 80).ip_address #=> "::1"
*/
static VALUE
addrinfo_ip_address(VALUE self)
{
rb_addrinfo_t *rai = get_addrinfo(self);
int family = ai_get_afamily(rai);
VALUE vflags;
VALUE ret;
if (!IS_IP_FAMILY(family))
rb_raise(rb_eSocket, "need IPv4 or IPv6 address");
vflags = INT2NUM(NI_NUMERICHOST|NI_NUMERICSERV);
ret = addrinfo_getnameinfo(1, &vflags, self);
return rb_ary_entry(ret, 0);
}
It calls addrinfo_getnameinfo
with NI_NUMERICHOST|NI_NUMERICSERV
flags which means it should not perform any actual DNS lookup.
/*
* call-seq:
* addrinfo.getnameinfo => [nodename, service]
* addrinfo.getnameinfo(flags) => [nodename, service]
*
* returns nodename and service as a pair of strings.
* This converts struct sockaddr in addrinfo to textual representation.
*
* flags should be bitwise OR of Socket::NI_??? constants.
*
* Addrinfo.tcp("127.0.0.1", 80).getnameinfo #=> ["localhost", "www"]
*
* Addrinfo.tcp("127.0.0.1", 80).getnameinfo(Socket::NI_NUMERICSERV)
* #=> ["localhost", "80"]
*/
static VALUE
addrinfo_getnameinfo(int argc, VALUE *argv, VALUE self)
{
rb_addrinfo_t *rai = get_addrinfo(self);
VALUE vflags;
char hbuf[1024], pbuf[1024];
int flags, error;
rb_scan_args(argc, argv, "01", &vflags);
flags = NIL_P(vflags) ? 0 : NUM2INT(vflags);
if (rai->socktype == SOCK_DGRAM)
flags |= NI_DGRAM;
error = rb_getnameinfo(&rai->addr.addr, rai->sockaddr_len,
hbuf, (socklen_t)sizeof(hbuf), pbuf, (socklen_t)sizeof(pbuf),
flags);
if (error) {
rsock_raise_resolution_error("getnameinfo", error);
}
return rb_assoc_new(rb_str_new2(hbuf), rb_str_new2(pbuf));
}
This code invokes rb_getnameinfo
:
static void *
nogvl_getnameinfo(void *arg)
{
struct getnameinfo_arg *ptr = arg;
return (void *)(VALUE)getnameinfo(ptr->sa, ptr->salen,
ptr->host, (socklen_t)ptr->hostlen,
ptr->serv, (socklen_t)ptr->servlen,
ptr->flags);
}
int
rb_getnameinfo(const struct sockaddr *sa, socklen_t salen,
char *host, size_t hostlen,
char *serv, size_t servlen, int flags)
{
struct getnameinfo_arg arg;
int ret;
arg.sa = sa;
arg.salen = salen;
arg.host = host;
arg.hostlen = hostlen;
arg.serv = serv;
arg.servlen = servlen;
arg.flags = flags;
ret = (int)(VALUE)rb_thread_call_without_gvl(nogvl_getnameinfo, &arg, RUBY_UBF_IO, 0);
return ret;
}
At first glance, this code looks a bit inefficient to me - numeric host/service should not incur any blocking overhead, but we still call it while releasing the GVL which doesn't make sense to me.
In addition, this is a simple operation which should directly read from the struct sockaddr
data, so EAI_AGAIN
(temporary failure in name resolution) is very odd. I need to find out under what circumstances that's occurring.
@travisbell are you able to show me the contents of /etc/nsswitch.conf
?
There is some useful discussion here: https://stackoverflow.com/questions/42041606/under-what-circumstances-can-getnameinfo-return-eai-again
@travisbell are you able to show me the contents of
/etc/nsswitch.conf
?
cat /etc/nsswitch.conf
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.
passwd: files
group: files
shadow: files
gshadow: files
hosts: files dns
networks: files
protocols: db files
services: db files
ethers: db files
rpc: db files
netgroup: nis
Seems fairly stock, I think?
Thanks for the incredible debugging! FYI--I had to roll back Ruby 3.3, which means I also rolled back Falcon on this service for now (it's just how our builds are setup). Just an FYI, I didn't roll it back because of this issue, I was hit by this memory leak so I'm waiting for the back port and 3.3.1 release to give this all another go.
Hmm, it looks really normal. I wonder if the memory leak could potentially cause the name resolution to fail.
Thanks for trying it all out and giving feedback, that can be a lot of work, so it's really appreciated.
I'll investigate more on my end.
Do you mind giving some details about the host environment? Are you running on some kind of instance? Can you give details of the hardware?
I believe this is a bug in Ruby v3.3.0: https://bugs.ruby-lang.org/issues/20172
As such, there is nothing we can do but wait for Ruby v3.3.1 to be released with the fix.
No kidding, another 3.3.0 bug? Ok, well that makes it easy then!
Thanks again @ioquatix for all the debugging you did to figure this one out. 🙌🏼
Maybe you can report back if you try 3.3.1 so we can confirm it's sorted, but I guess in this case no news is good news :)
Hey Samuel,
I migrated one of our production apps to Falcon and started seeing this error pop up every few seconds (it's under moderate load):
This is Ruby 3.3 in Ubuntu 20.04.
It doesn't always throw it, it's like 1/100 requests or something. Any idea why the rack adapter would be throwing this from time to time? The DNS resolver we use in prod is an internal DNS address, local to our network within AWS.
Never seen this before until switching to Falcon.