Ledge will only connect to Redis running on localhost

jsierles commented 4 years ago

With the following config, I consistently see a Connection refused error:

2020-07-16T20:16:37Z service/web/web-cc7d6bd9c-g2hqr 2020/07/16 20:16:37 [error] 13#13: *881 connect() failed (111: Connection refused), client: 10.10.10.2, server: _, request: "GET / HTTP/2.0", host: "my.host"

If I point CACHE_URL at localhost Redis, it works. I've tried higher timeout settings. I am able to connect to redis instantly from the command line with redis-cli.

init_by_lua_block {

    local ledge = require "ledge"
    local redis_host = os.getenv("CACHE_URL")

    ledge.configure({
      redis_connector_params = {
        url = redis_host,
        connect_timeout = 2000
      }
    })
  }

jsierles commented 4 years ago

This appears to be happening due to the worker connection timing out. So now will look into passing a connection timeout there.

jsierles commented 4 years ago

This turned out not to be the case. Setting a high timeout for both connections does not work. So I ended up proxying redis through an openresty stream block locally. However, this should not be necessary. How can I debug this issue further?

pintsized commented 4 years ago

Can you post a minimal yet complete configuration to replicate the issue? Obviously if something this fundamental wasn't working we would know about it, so it's going to be a configuration detail most likely.

Specifically you're getting "connection refused" (not timed out), so I'd be looking into why OpenResty can't see your redis host. Are you using hostnames or literal IPs?

jsierles commented 4 years ago

We've tried with direct IPs and hosts. It's definitely not a hostname issue since we use these same variables elsewhere in the config with resolver local=on. If it weren't able to resolve, we'd see a different error (and have tried using invalid hostnames to test that).

Once we switch to the localhost proxy, it works fine. This is inside a Kubernetes cluster inside AWS EKS, if that matters.

user www-data;

# Automatically scale processes based on detected CPU count
worker_processes  auto;

# Redis for ledge cache storage
env REDIS_HOST;
env RACK;

error_log  stderr debug;
pid        /var/run/nginx.pid;

events {
  worker_connections  1024;
}

http {
  include       mime.types;
  default_type  application/octet-stream;

  access_log  /dev/stdout;

  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;

  keepalive_timeout  65;

  gzip on;
  gzip_http_version 1.0;
  gzip_comp_level 2;
  gzip_proxied any;
  gzip_vary off;
  gzip_types text/plain text/css application/x-javascript text/xml application/xml application/rss+xml application/atom+xml text/javascript application/javascript application/json text/mathml;
  gzip_min_length  1000;
  gzip_disable     "MSIE [1-6]\.";

  server_names_hash_bucket_size 64;
  types_hash_max_size 2048;
  types_hash_bucket_size 64;
  client_max_body_size 250m;

  lua_shared_dict my_locks 100k;
  lua_package_path "/etc/nginx/conf.d/?.lua;./lua/?.lua;$prefix/conf/?.lua;$prefix/conf.d/?.lua;/usr/local/lib/lua/ledge/?.lua;/usr/local/openresty/site/lualib/?.lua;/usr/local/openresty/site/lualib/resty/?.lua;/usr/local/openresty/site/lualib/resty/qless/?.lua;;";

  resolver local=on ipv6=off;
  resolver_timeout 5s;

  if_modified_since Off;
  lua_check_client_abort On;

  init_by_lua_block {

    local ledge = require "ledge"
    local upstream_host = "web.rails." .. os.getenv("RACK") .. ".local"
    local redis_host = os.getenv("REDIS_HOST")

    ledge.configure({
      redis_connector_params = {
        url = "redis://".. redis_host .. ":6379",
        connect_timeout = 1000
      }
    })

    ledge.set_handler_defaults({
      upstream_host = upstream_host,
      upstream_port = 80
    })
  }

  init_worker_by_lua_block {
    require("ledge").create_worker():run()
  }

  server {
    listen 80;

    location / {
      content_by_lua_block {
        local handler = require("ledge").create_handler()
        handler:run()
      }
    }
  }

pintsized commented 4 years ago

And you're getting connection refused only on the background worker connections, not the in-flight ones?

jsierles commented 4 years ago

On both connections. While you can't really tell from the logs - even removing the background worker leads to this error.

jsierles commented 4 years ago

Here's the config we use for proxying to Redis. This works, but of course is an extra step we'd love to avoid.

stream {
  resolver local=on ipv6=off;
  resolver_timeout 5s;

  lua_add_variable $redis_host;
  preread_by_lua_block { ngx.var.redis_host = os.getenv("REDIS_HOST") }

  server {
    listen 6379;
    proxy_pass $redis_host:6379;
  }
}

pintsized commented 4 years ago

Here's the config we use for proxying to Redis. This works, but of course is an extra step we'd love to avoid.

Yeah, it really shouldn't be necessary.

Are all connections failing, or is it in any way intermittent over time?

Nothing is jumping out at me from your config. But remember, there's no magic here. In the end, whatever you are specifying for host and port end up in tcpsock:connect.

Can you try a super minimal content_by_lua_block that connects to your host and port manually?

Then next layer up is lua-resty-redis-connector, which again you could do a quick manual experiment with your config to see where it's failing.

Bottom line, if your connection is being refused, it's because tcpsock:connect(host, port) is returning nil and an error string.

jsierles commented 4 years ago

Fair enough - we will test that. Meanwhile, while connecting through the proxy, by default I believe keepalive is not supported so we may be creating more connections than we should, and are seeing lua redis connect timeouts there as well. Any tips on improving that situation?

meilke commented 3 years ago

I can reproduce this as well, either by using the inline stream method or using a local proxy using haproxy. This is the only way to get it working on a Docker Compose network or in Kubernetes. In Kubernetes even the upstream host's have to be FQDN's, on Docker Compose it lets me get away with the short host names. There are definitely DNS issues and there seems to be a difference in name resolution between the settings for resolving the upstream hosts and the Redis host.

ledgetech / ledge

Ledge will only connect to Redis running on localhost #194