Socket::ResolutionError: getnameinfo: Temporary failure in name resolution

travisbell commented 5 months ago

Hey Samuel,

I migrated one of our production apps to Falcon and started seeing this error pop up every few seconds (it's under moderate load):

Socket::ResolutionError: getnameinfo: Temporary failure in name resolution
    /usr/local/bundle/gems/protocol-rack-0.4.1/lib/protocol/rack/adapter/generic.rb:86:in `ip_address'
    /usr/local/bundle/gems/protocol-rack-0.4.1/lib/protocol/rack/adapter/generic.rb:86:in `unwrap_request'
    /usr/local/bundle/gems/protocol-rack-0.4.1/lib/protocol/rack/adapter/rack2.rb:68:in `make_environment'
    /usr/local/bundle/gems/protocol-rack-0.4.1/lib/protocol/rack/adapter/rack2.rb:77:in `call'
    /usr/local/bundle/gems/protocol-http-0.25.0/lib/protocol/http/middleware.rb:43:in `call'
    /usr/local/bundle/gems/protocol-rack-0.4.1/lib/protocol/rack/rewindable.rb:58:in `call'
    /usr/local/bundle/gems/protocol-http-0.25.0/lib/protocol/http/middleware.rb:43:in `call'
    /usr/local/bundle/gems/protocol-http-0.25.0/lib/protocol/http/content_encoding.rb:29:in `call'
    /usr/local/bundle/gems/protocol-http-0.25.0/lib/protocol/http/middleware.rb:43:in `call'
    /usr/local/bundle/gems/async-http-0.61.0/lib/async/http/server.rb:51:in `block in accept'
    /usr/local/bundle/gems/async-http-0.61.0/lib/async/http/protocol/http1/server.rb:50:in `each'
    /usr/local/bundle/gems/async-http-0.61.0/lib/async/http/server.rb:40:in `accept'
    /usr/local/bundle/gems/async-io-1.38.1/lib/async/io/server.rb:15:in `block in accept_each'
    /usr/local/bundle/gems/async-io-1.38.1/lib/async/io/socket.rb:58:in `block in accept'
    /usr/local/bundle/gems/async-2.8.0/lib/async/task.rb:161:in `block in run'
    /usr/local/bundle/gems/async-2.8.0/lib/async/task.rb:331:in `block in schedule'

This is Ruby 3.3 in Ubuntu 20.04.

It doesn't always throw it, it's like 1/100 requests or something. Any idea why the rack adapter would be throwing this from time to time? The DNS resolver we use in prod is an internal DNS address, local to our network within AWS.

Never seen this before until switching to Falcon.

ioquatix commented 5 months ago

This is an interesting error. Async provides asynchronous address resolution, but this shouldn't be used as we are just resolving the IP address of the inbound connection, an operation that shouldn't involve any DNS lookup at all.

In the first instance, we could consider rescuing any lookup errors and just bypassing the socket IP address. However, this isn't strictly a good idea.

Perhaps a better option would be to try and collect/log the circumstances which cause this error to occur so we can investigate further.

Let me see if I can reproduce the issue.

ioquatix commented 5 months ago

I have not tried to reproduce this yet, but I've checked the implementation.

This is the code invoked by Addrinfo#ip_address:

/*
 * call-seq:
 *   addrinfo.ip_address => string
 *
 * Returns the IP address as a string.
 *
 *   Addrinfo.tcp("127.0.0.1", 80).ip_address    #=> "127.0.0.1"
 *   Addrinfo.tcp("::1", 80).ip_address          #=> "::1"
 */
static VALUE
addrinfo_ip_address(VALUE self)
{
    rb_addrinfo_t *rai = get_addrinfo(self);
    int family = ai_get_afamily(rai);
    VALUE vflags;
    VALUE ret;

    if (!IS_IP_FAMILY(family))
        rb_raise(rb_eSocket, "need IPv4 or IPv6 address");

    vflags = INT2NUM(NI_NUMERICHOST|NI_NUMERICSERV);
    ret = addrinfo_getnameinfo(1, &vflags, self);
    return rb_ary_entry(ret, 0);
}

It calls addrinfo_getnameinfo with NI_NUMERICHOST|NI_NUMERICSERV flags which means it should not perform any actual DNS lookup.

/*
 * call-seq:
 *   addrinfo.getnameinfo        => [nodename, service]
 *   addrinfo.getnameinfo(flags) => [nodename, service]
 *
 * returns nodename and service as a pair of strings.
 * This converts struct sockaddr in addrinfo to textual representation.
 *
 * flags should be bitwise OR of Socket::NI_??? constants.
 *
 *   Addrinfo.tcp("127.0.0.1", 80).getnameinfo #=> ["localhost", "www"]
 *
 *   Addrinfo.tcp("127.0.0.1", 80).getnameinfo(Socket::NI_NUMERICSERV)
 *   #=> ["localhost", "80"]
 */
static VALUE
addrinfo_getnameinfo(int argc, VALUE *argv, VALUE self)
{
    rb_addrinfo_t *rai = get_addrinfo(self);
    VALUE vflags;
    char hbuf[1024], pbuf[1024];
    int flags, error;

    rb_scan_args(argc, argv, "01", &vflags);

    flags = NIL_P(vflags) ? 0 : NUM2INT(vflags);

    if (rai->socktype == SOCK_DGRAM)
        flags |= NI_DGRAM;

    error = rb_getnameinfo(&rai->addr.addr, rai->sockaddr_len,
                           hbuf, (socklen_t)sizeof(hbuf), pbuf, (socklen_t)sizeof(pbuf),
                           flags);
    if (error) {
        rsock_raise_resolution_error("getnameinfo", error);
    }

    return rb_assoc_new(rb_str_new2(hbuf), rb_str_new2(pbuf));
}

This code invokes rb_getnameinfo:

static void *
nogvl_getnameinfo(void *arg)
{
    struct getnameinfo_arg *ptr = arg;
    return (void *)(VALUE)getnameinfo(ptr->sa, ptr->salen,
                                      ptr->host, (socklen_t)ptr->hostlen,
                                      ptr->serv, (socklen_t)ptr->servlen,
                                      ptr->flags);
}
int
rb_getnameinfo(const struct sockaddr *sa, socklen_t salen,
               char *host, size_t hostlen,
               char *serv, size_t servlen, int flags)
{
    struct getnameinfo_arg arg;
    int ret;
    arg.sa = sa;
    arg.salen = salen;
    arg.host = host;
    arg.hostlen = hostlen;
    arg.serv = serv;
    arg.servlen = servlen;
    arg.flags = flags;
    ret = (int)(VALUE)rb_thread_call_without_gvl(nogvl_getnameinfo, &arg, RUBY_UBF_IO, 0);
    return ret;
}

At first glance, this code looks a bit inefficient to me - numeric host/service should not incur any blocking overhead, but we still call it while releasing the GVL which doesn't make sense to me.

In addition, this is a simple operation which should directly read from the struct sockaddr data, so EAI_AGAIN (temporary failure in name resolution) is very odd. I need to find out under what circumstances that's occurring.

ioquatix commented 5 months ago

@travisbell are you able to show me the contents of /etc/nsswitch.conf?

ioquatix commented 5 months ago

There is some useful discussion here: https://stackoverflow.com/questions/42041606/under-what-circumstances-can-getnameinfo-return-eai-again

travisbell commented 5 months ago

@travisbell are you able to show me the contents of /etc/nsswitch.conf?

cat /etc/nsswitch.conf

# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd:         files
group:          files
shadow:         files
gshadow:        files

hosts:          files dns
networks:       files

protocols:      db files
services:       db files
ethers:         db files
rpc:            db files

netgroup:       nis

Seems fairly stock, I think?

Thanks for the incredible debugging! FYI--I had to roll back Ruby 3.3, which means I also rolled back Falcon on this service for now (it's just how our builds are setup). Just an FYI, I didn't roll it back because of this issue, I was hit by this memory leak so I'm waiting for the back port and 3.3.1 release to give this all another go.

ioquatix commented 5 months ago

Hmm, it looks really normal. I wonder if the memory leak could potentially cause the name resolution to fail.

Thanks for trying it all out and giving feedback, that can be a lot of work, so it's really appreciated.

I'll investigate more on my end.

ioquatix commented 5 months ago

Do you mind giving some details about the host environment? Are you running on some kind of instance? Can you give details of the hardware?

ioquatix commented 5 months ago

I believe this is a bug in Ruby v3.3.0: https://bugs.ruby-lang.org/issues/20172

As such, there is nothing we can do but wait for Ruby v3.3.1 to be released with the fix.

travisbell commented 5 months ago

No kidding, another 3.3.0 bug? Ok, well that makes it easy then!

Thanks again @ioquatix for all the debugging you did to figure this one out. 🙌🏼

ioquatix commented 5 months ago

Maybe you can report back if you try 3.3.1 so we can confirm it's sorted, but I guess in this case no news is good news :)

socketry / falcon

Socket::ResolutionError: getnameinfo: Temporary failure in name resolution #217