Closed juntalis closed 7 years ago
Thanks for the bug report! I can't reproduce it on my machine. I don't have any Amazon EC2 instance, but I can install a Ubuntu VM to try to reproduce this; what's the distribution version?
Thanks for the quick response. Distro details are:
ubuntu@benchmarker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
Was also able to replicate it on a local Linux Mint VM:
charles@mint-vm ~ $ lsb_release -a
No LSB modules are available.
Distributor ID: LinuxMint
Description: Linux Mint 18.2 Sonya
Release: 18.2
Codename: sonya
and an Amazon Linux instance.
Shit - just realized I missed an important detail: I had been running the same command from history and didn't notice, but I've actually got a --script lwan-test.lua
at the end of my wrk command. (see below for lwan-test.lua)
-- Thread tracking
local counter = 1
local threads = {}
-- Set common headers
wrk.headers['Connection'] = 'keep-alive'
wrk.headers['Pragma'] = 'no-cache'
wrk.headers['Cache-Control'] = 'no-cache'
wrk.headers['Origin'] = 'http://google.com'
wrk.headers['Referer'] = 'http://google.com/s?q=foobar'
wrk.headers['Accept-Encoding'] = 'gzip, deflate, sdch, br'
-- User-Agent pool (not actually a class - just namespaced)
local user_agents = {
[[MT6572/V1 Linux/3.4.5 Android/4.2.2 Release/08.07.2013 Browser/AppleWebKit534.30 Profile/MIDP-2.0 Configuration/CLDC-1.1 Mobile Safari/534.30 Android 4.2.2;]],
[[Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1)]],
[[Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; 360SE)]],
[[Mozilla/4.0 (compatible; MSIE 11.0; Windows NT 6.2; 360SE)]],
[[Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)]],
[[Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)]],
[[Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)]],
[[Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; Tablet PC 2.0; .NET4.0E; .NET CLR 1.1.4322)]],
[[Mozilla/5.0 (BB10; Kbd) AppleWebKit/537.10+ (KHTML, like Gecko) Version/10.1.0.1720 Mobile Safari/537.10+]],
[[Mozilla/5.0 (BB10; Touch) AppleWebKit/537.35+ (KHTML, like Gecko) Version/10.2.1.3247 Mobile Safari/537.35+]],
[[Mozilla/5.0 (BlackBerry; U; BlackBerry 9220; es) AppleWebKit/534.11+ (KHTML, like Gecko) Version/7.1.0.398 Mobile Safari/534.11+]],
[[Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19]],
[[Mozilla/5.0 (Linux; Android 4.2.2; AFTB Build/JDQ39) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.173 Mobile Safari/537.22]],
[[Mozilla/5.0 (Linux; Android 4.2.2; AT7-A Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Safari/537.36]],
[[Mozilla/5.0 (Linux; Android 4.2.2; C2105 Build/15.3.A.1.17) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.95 Mobile Safari/537.36]],
[[Mozilla/5.0 (Linux; Android 4.2.2; Celkon A107+ Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.59 Mobile Safari/537.36]],
[[Mozilla/5.0 (Linux; Android 4.2.2; GT-I9105P Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.133 Mobile Safari/537.36]],
[[Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36]]
}
local function get_user_agent()
local result, index
while result == nil do
local ua_count = #user_agents
if ua_count == 0 then
index = 1
result = [[Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1)]]
else
index = math.random(1, ua_count)
result = user_agents[index]
end
end
user_agents[index] = nil
return result
end
function setup(thread)
local url = '/plaintext?tid=' .. counter .. '¶m1=a¶m2=b¶m3=c¶m4=d'
thread:set('path', url)
thread:set('id', counter)
thread:set('user_agent', get_user_agent())
table.insert(threads, thread)
counter = counter + 1
end
function request()
return wrk.format('GET', path, { ['User-Agent']=user_agent })
end
The script sets a few handled and unhandled HTTP headers and adds query string parameters to the requests. I've just confirmed: the memory issue does not occur if those headers and query string parameters are not added.
Edit: Looks to be the addition of the query string parameters that's causing it.
Ah, it now makes some sense. I have a hunch of what could it be. If you set the 'Connection' header to 'close', will the memory usage skyrocket as well?
Doesn't look like wrk supports controlling connection persistence. Using lua to set the header causes it to execute a single sets of requests, then idle forever. I'll give it a shot in weighttp and report the results in about an hour or so when I get back in front of my PC.
About an hour or fifteen - somewhere in that range.
Setting Connection: close
doesn't appear to have any impact on memory usage, but I was able to replicate the issue by adding a few Cookies to the requests. I'd assume POST data is also affected, but I can verify if you'd like.
Edit: Also, I'm able to resolve the issue by manually cleaning up the two arrays at the end of my request handler:
lwan_key_value_array_reset(&request->query_params);
lwan_key_value_array_reset(&request->cookies);
Edit2: Managed to confirm: form-based POST data also showing the behavior.
This behavior is caused by the coroutine only cleaning up the memory allocations when a connection is closed; so your workaround is sound. I'll prepare a patch today to fix this.
@juntalis Could you please try again? Commit 43f0536 might have fixed the issue.
@lpereira Nice - that did it. Thanks for looking into this and fixing it so quickly. Will go ahead and close this issue.
I say possible, as I'm not quite familiar enough with lwan's internals yet to know it's a mistake I'm making or if it's a part of the design that I'm not aware of.
I start encountering OOM crashes after increasing the duration of my load tests, so after failing to find any cause in my code, I threw together the following aample application, compiled with the same Release flags as lwan:
And the corresponding
.conf
file:To replicate the issue, I ran a 5 minute load test with wrk from another EC2 instance:
By the end of the 5 minutes, the
lwan-sample
process's memory usage was up to 5.6GB. I waited for 5 minutes just in case there was a deferred cleanup cycle I was unaware of, then hit the server with another 5 minute load test in order to grab the syslog of the crash when it exceeded 8GB: syslogI spent some time today trying to track the issue down, but didn't much further. The only scenario I found that didn't result in memory consumption issues was running the same load test against the lwan executable's serve_files handler. In that case, memory usage capped out at around 2.5GB~3GB, and dropped back to its starting point when the load test completed.