lpereira / lwan

Experimental, scalable, high performance HTTP server
https://lwan.ws
GNU General Public License v2.0
5.94k stars 548 forks source link

Possible memory leak using lwan as a library (Ubuntu EC2 instance) #206

Closed juntalis closed 7 years ago

juntalis commented 7 years ago

I say possible, as I'm not quite familiar enough with lwan's internals yet to know it's a mistake I'm making or if it's a part of the design that I'm not aware of.

I start encountering OOM crashes after increasing the duration of my load tests, so after failing to find any cause in my code, I threw together the following aample application, compiled with the same Release flags as lwan:

#include <stdlib.h>
#include <string.h>
#include <time.h>

#include "lwan.h"
#include "lwan-config.h"

static const char hello_world[] = "Hello, World!";

static enum lwan_http_status
plaintext(struct lwan_request *request __attribute__((unused)),
          struct lwan_response *response,
          void *data __attribute__((unused)))
{
    strbuf_set_static(response->buffer, hello_world, sizeof(hello_world) - 1);
    response->mime_type = "text/plain";
    return HTTP_OK;
}

int main(void)
{
    static const struct lwan_url_map url_map[] = {
        { .prefix = "/plaintext", .handler = plaintext },
        { .prefix = NULL }
    };
    struct lwan l;
    lwan_init(&l);
    srand((unsigned int)time(NULL));
    lwan_set_url_map(&l, url_map);
    lwan_main_loop(&l);
    lwan_shutdown(&l);
    return 0;
}

And the corresponding .conf file:

threads = 16
quiet = true

listener *:8080 {
}

To replicate the issue, I ran a 5 minute load test with wrk from another EC2 instance:

./wrk -t14 -c1000 -d5m http://10.0.10.211:8080/plaintext --latency --timeout 10s
Running 5m test @ http://10.0.10.211:8080/plaintext
  14 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.76ms    5.36ms 646.03ms   97.55%
    Req/Sec     5.19k   516.29    28.45k    84.24%
  Latency Distribution
     50%   13.56ms
     75%   14.51ms
     90%   15.24ms
     99%   21.76ms
  21703979 requests in 5.00m, 3.97GB read
  Socket errors: connect 0, read 3799, write 0, timeout 0
Requests/sec:  72324.16
Transfer/sec:     13.53MB

By the end of the 5 minutes, the lwan-sample process's memory usage was up to 5.6GB. I waited for 5 minutes just in case there was a deferred cleanup cycle I was unaware of, then hit the server with another 5 minute load test in order to grab the syslog of the crash when it exceeded 8GB: syslog

I spent some time today trying to track the issue down, but didn't much further. The only scenario I found that didn't result in memory consumption issues was running the same load test against the lwan executable's serve_files handler. In that case, memory usage capped out at around 2.5GB~3GB, and dropped back to its starting point when the load test completed.

lpereira commented 7 years ago

Thanks for the bug report! I can't reproduce it on my machine. I don't have any Amazon EC2 instance, but I can install a Ubuntu VM to try to reproduce this; what's the distribution version?

juntalis commented 7 years ago

Thanks for the quick response. Distro details are:

ubuntu@benchmarker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.3 LTS
Release:        16.04
Codename:       xenial

Was also able to replicate it on a local Linux Mint VM:

charles@mint-vm ~ $ lsb_release -a
No LSB modules are available.
Distributor ID: LinuxMint
Description:    Linux Mint 18.2 Sonya
Release:    18.2
Codename:   sonya

and an Amazon Linux instance.

juntalis commented 7 years ago

Shit - just realized I missed an important detail: I had been running the same command from history and didn't notice, but I've actually got a --script lwan-test.lua at the end of my wrk command. (see below for lwan-test.lua)

-- Thread tracking
local counter = 1
local threads = {}

-- Set common headers
wrk.headers['Connection'] = 'keep-alive'
wrk.headers['Pragma'] = 'no-cache'
wrk.headers['Cache-Control'] = 'no-cache'
wrk.headers['Origin'] = 'http://google.com'
wrk.headers['Referer'] = 'http://google.com/s?q=foobar'
wrk.headers['Accept-Encoding'] = 'gzip, deflate, sdch, br'

-- User-Agent pool (not actually a class - just namespaced)
local user_agents = {
    [[MT6572/V1 Linux/3.4.5 Android/4.2.2 Release/08.07.2013 Browser/AppleWebKit534.30 Profile/MIDP-2.0 Configuration/CLDC-1.1 Mobile Safari/534.30 Android 4.2.2;]],
    [[Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1)]],
    [[Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; 360SE)]],
    [[Mozilla/4.0 (compatible; MSIE 11.0; Windows NT 6.2; 360SE)]],
    [[Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)]],
    [[Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)]],
    [[Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)]],
    [[Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; Tablet PC 2.0; .NET4.0E; .NET CLR 1.1.4322)]],
    [[Mozilla/5.0 (BB10; Kbd) AppleWebKit/537.10+ (KHTML, like Gecko) Version/10.1.0.1720 Mobile Safari/537.10+]],
    [[Mozilla/5.0 (BB10; Touch) AppleWebKit/537.35+ (KHTML, like Gecko) Version/10.2.1.3247 Mobile Safari/537.35+]],
    [[Mozilla/5.0 (BlackBerry; U; BlackBerry 9220; es) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.398 Mobile Safari/534.11+]],
    [[Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19]],
    [[Mozilla/5.0 (Linux; Android 4.2.2; AFTB Build/JDQ39) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.173 Mobile Safari/537.22]],
    [[Mozilla/5.0 (Linux; Android 4.2.2; AT7-A Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Safari/537.36]],
    [[Mozilla/5.0 (Linux; Android 4.2.2; C2105 Build/15.3.A.1.17) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.95 Mobile Safari/537.36]],
    [[Mozilla/5.0 (Linux; Android 4.2.2; Celkon A107+ Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.59 Mobile Safari/537.36]],
    [[Mozilla/5.0 (Linux; Android 4.2.2; GT-I9105P Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.133 Mobile Safari/537.36]],
    [[Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36]]
}

local function get_user_agent()
    local result, index
    while result == nil do
        local ua_count = #user_agents
        if ua_count == 0 then
            index = 1
            result = [[Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1)]]
        else
            index = math.random(1, ua_count)
            result = user_agents[index]
        end
    end
    user_agents[index] = nil
    return result
end

function setup(thread)
    local url = '/plaintext?tid=' .. counter .. '&param1=a&param2=b&param3=c&param4=d'
    thread:set('path', url)
    thread:set('id', counter)
    thread:set('user_agent', get_user_agent())
    table.insert(threads, thread)
    counter = counter + 1
end

function request()
    return wrk.format('GET', path, { ['User-Agent']=user_agent })
end

The script sets a few handled and unhandled HTTP headers and adds query string parameters to the requests. I've just confirmed: the memory issue does not occur if those headers and query string parameters are not added.

Edit: Looks to be the addition of the query string parameters that's causing it.

lpereira commented 7 years ago

Ah, it now makes some sense. I have a hunch of what could it be. If you set the 'Connection' header to 'close', will the memory usage skyrocket as well?

juntalis commented 7 years ago

Doesn't look like wrk supports controlling connection persistence. Using lua to set the header causes it to execute a single sets of requests, then idle forever. I'll give it a shot in weighttp and report the results in about an hour or so when I get back in front of my PC.

juntalis commented 7 years ago

About an hour or fifteen - somewhere in that range.

Setting Connection: close doesn't appear to have any impact on memory usage, but I was able to replicate the issue by adding a few Cookies to the requests. I'd assume POST data is also affected, but I can verify if you'd like.

Edit: Also, I'm able to resolve the issue by manually cleaning up the two arrays at the end of my request handler:

        lwan_key_value_array_reset(&request->query_params);
        lwan_key_value_array_reset(&request->cookies);

Edit2: Managed to confirm: form-based POST data also showing the behavior.

lpereira commented 7 years ago

This behavior is caused by the coroutine only cleaning up the memory allocations when a connection is closed; so your workaround is sound. I'll prepare a patch today to fix this.

lpereira commented 7 years ago

@juntalis Could you please try again? Commit 43f0536 might have fixed the issue.

juntalis commented 7 years ago

@lpereira Nice - that did it. Thanks for looking into this and fixing it so quickly. Will go ahead and close this issue.