Upstream timeout leads to mixed up responses

meilke commented 3 years ago

I noticed that from time to time I get wrong responses to my HTTP requests. In one of our UI's some users all of a sudden would get HTML responses that originated from other users and that were not looking like the data they expected, obviously.

After looking into the openresty logs I would see 524 responses near the actual problematic requests (that were receiving the wrong response data). The 524 did not happen in the connection establishing but in the actual request being done by the HTTP client. The wrong response data came from these timed out requests. So it seems like the timed out request is still going on and when data arrives it might get "re-used" elsewhere.

A similar problem was reported in lua-resty-http: https://github.com/ledgetech/lua-resty-http/issues/138

I found that I can solve the issue by calling client:close() after generating the 524 but that might not be enough. I could follow up with a PR but maybe there is a different solution out there.

My setup:

v2.3.0 (installed via this page's releases download)
openresty v1.17.8.2 (but it also happens with the current one)
lua-ffi-zlib v0.5-0
lua-resty-cookie v0.1.0-1
lua-resty-http v0.16.1-0
lua-resty-qless v0.11-0
lua-resty-redis-connector v0.10-0
I don't use any caching, I am only in it for the ESI (aka upstream has no cache tags)

My config:

daemon off;

pid /var/run/nginx.pid;

error_log /dev/stderr;

events {
    worker_connections 1024;
}
# from https://github.com/ledgetech/ledge/issues/194
stream {
    resolver local=on ipv6=off;
    resolver_timeout 5s;

    lua_add_variable $redis_host;
    preread_by_lua_block {
        ngx.var.redis_host = "my.redis.host"
    }

    server {
        listen 6379;
        proxy_pass $redis_host:6379;
    }
}

http {
    lua_package_path "/opt/lua/?.lua;;";
    if_modified_since off;
    lua_check_client_abort on;

    init_by_lua_block {
        function set_surrogate_response_header(res)
            local status = res.status
            if not res.header["Surrogate-Control"] and not (status > 300 and status < 303) then
                res.header["Surrogate-Control"] = 'content="ESI/1.0"'
            end
        end
        require("ledge").configure({
            redis_connector_params = {
                host = "127.0.0.1",
                port = 6379,
                db = 5,
            },
            qless_db = 4,
        })
        require("ledge").set_handler_defaults({
            upstream_port = 80,
        })
    }

    init_worker_by_lua_block {
        require("ledge").create_worker():run()
    }

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for" '
                      '$request_time';

    access_log /dev/stdout main;

    resolver 127.0.0.11 valid=30s;
    auth_basic "restricted";
    auth_basic_user_file /path/to/file

    server {
        listen 80 default_server;
        server_name _;

        client_max_body_size   0;

        proxy_connect_timeout  600;
        proxy_send_timeout     1800;
        proxy_read_timeout     1800;
        send_timeout           1800;
        location / {
            content_by_lua_block {
                local handler = require("ledge").create_handler({
                    upstream_host = "main-upstream.host",
                    esi_enabled = true,
                })
                handler:bind("after_upstream_request", set_surrogate_response_header)
                handler:run()
            }
        }
        location /other {
            content_by_lua_block {
                local handler = require("ledge").create_handler({
                    upstream_host = "other.upstream.host",
                    upstream_port = 80,
                    esi_enabled = true,
                })
                handler:bind("after_upstream_request", set_surrogate_response_header)
                handler:run()
            }
        }
        location /non-esi/ {
            proxy_pass http://non-esi.host/;

            proxy_set_header    Host               $host;
            proxy_set_header    X-Real-IP          $remote_addr;
            proxy_set_header    X-Forwarded-For    $proxy_add_x_forwarded_for;
            proxy_set_header    X-Forwarded-Host   $http_host/non-esi;
            proxy_set_header    X-Forwarded-Server $host;
            proxy_set_header    X-Forwarded-Port   $server_port;
            proxy_set_header    X-Forwarded-Proto  $scheme;
            proxy_set_header    Remote-User        $remote_user;
        }

        location /health {
            return 200;
        }
    }
}

pintsized commented 3 years ago

Hey, thanks for the detailed report. I think the solution would be to ensure the state machine goes through aborting like it would when the client connection aborts. But I'm a bit unsure because I thought that calling setkeepalive on a connection where data remains to be read on the wire will be safely denied and closed automatically - that is, it won't be reused. Whereas your symptoms suggest otherwise.

I'll try to write some tests that provoke the situation, hopefully soon. Sorry, I realise this is actually a pretty serious issue. If you're able to get a test case to demonstrate the issue that would be great.

meilke commented 3 years ago

I never tried to get the test cases involved, I only ever used the library in an actual openresty installation. I can reproduce it fairly easy by setting a ridiculously low value to upstream_read_timeout and in parallel or short after doing a lot of other requests. Our UI makes this easy as there is a frameset involved that refreshes a few frames from HTTP requests (that then show the wrong part in a wrong frame).

meilke commented 3 years ago

I could reproduce the issue at #200 in a test case (without my - probably - incorrect fix, of course). I have one timed out request and three other requests where the timed out request response shows up later for a different request.

oligond commented 3 years ago

Hi @pintsized, as my colleague has moved on in the meantime, I'm trying to take over here. Did you find @meilke's suggested test case helpful so far? Do you see anything else we could contribute atm for advancing this issue? Thx!

pintsized commented 3 years ago

Hey, at this stage no, but thanks. The PR is super useful, I just need a little time to run it myself and figure out what the best place for a fix is (and to understand what detail I'm missing that led to this issue). Your best short term workaround is a longer timeout, because I'm pretty sure this can only happen when our socket timeout is exhausted (i.e. it can't happen if the downstream client disconnects or if the origin gives up before our timeout).

I'll try to get to it asap.

ledgetech / ledge

Upstream timeout leads to mixed up responses #199