Memory Leak - Githubissues

MansoorMajeed commented 8 years ago

@erankor we were using the module for a couple of weeks and it's been working great. But, we have noticed that the nginx worker processes are getting heavier and heavier. Right now, each nginx worker weighs ~ 900MB each. The processes were running since Aug 18.

The following is the nginx conf [redacted non important directives]

worker_processes  4;
worker_rlimit_nofile 1000000;
events {
  worker_connections  512;
}
http {
    include       mime.types;
    default_type  application/octet-stream;
    sendfile   on;
    tcp_nopush on;
    tcp_nodelay off
   keepalive_timeout  65;

  server {
      listen port;
      server_name name;
        open_file_cache          max=1000 inactive=5m;
        open_file_cache_valid    2m;
        open_file_cache_min_uses 1;
        open_file_cache_errors   on;
        aio on;

        vod_upstream_location /upstream1;

   location ^~ /upstream1{
            internal;
            proxy_set_header    Host xxxxxxxxxxxxxxxxx;
            proxy_set_header    Authorization $auth_token;
            proxy_set_header    Date $current_time;
            proxy_buffering off;
            proxy_pass http://xxxxxxxxxxxxxxxxxxx/;
            proxy_set_header Host 'xxxxxxxxxxxxxxxxxxxxxx';
        }

    location /video {
        vod none; # progressive
         ---snip-----

     }

}

Any idea why this might be happening? :)

erankor commented 8 years ago

are you using any of the caches of this module ? I don't see them in this conf snippet, but you maybe you didn't paste it. if so, then it makes sense that memory usage grows until the cache is populated. another option is that requests are not being closed, are you seeing an increasing number of nginx connections ? (i'm usually looking at the number of handles in /proc/<nginx worker pid>/fd) if that is the case, it may be because of aio requests that get stuck, I see it occasionally in our env. if you reload nginx, do you get errors such as open socket #1234 left in connection 123 ? finally, are you using any other custom modules ?

MansoorMajeed commented 8 years ago

The above conf is all. Rest is some rewrite rules and auth stuff.

Output of top:

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
25911 nginx     20   0 1008m 968m 1344 S   0.0 26.1 182:54.94 nginx: worker process
25910 nginx     20   0  939m 899m 1344 S   1.0 24.2 192:56.41 nginx: worker process
25909 nginx     20   0  772m 732m 1344 S   1.0 19.7 178:18.31 nginx: worker process
25912 nginx     20   0  584m 544m 1344 S   4.0 14.6 183:04.70 nginx: worker process

And the file descriptors

root@server:~# ls -l /proc/25911/fd/ | grep socket
lrwx------ 1 nginx nginx 64 Sep  6 08:02 11 -> socket:[3829768]
lrwx------ 1 nginx nginx 64 Sep  7 02:31 15 -> socket:[17675135]
lrwx------ 1 nginx nginx 64 Sep  6 08:02 17 -> socket:[17674720]
lrwx------ 1 nginx nginx 64 Sep  6 08:08 18 -> socket:[17674723]
lrwx------ 1 nginx nginx 64 Sep  6 08:02 23 -> socket:[17662717]
lrwx------ 1 nginx nginx 64 Sep  6 08:02 25 -> socket:[17662724]
lrwx------ 1 nginx nginx 64 Sep  6 08:02 3 -> socket:[3829763]
lrwx------ 1 nginx nginx 64 Sep  7 02:31 34 -> socket:[17662719]
lrwx------ 1 nginx nginx 64 Sep  7 02:31 37 -> socket:[17662723]
lrwx------ 1 nginx nginx 64 Sep  6 08:08 5 -> socket:[17675132]
lrwx------ 1 nginx nginx 64 Sep  6 08:02 6 -> socket:[3829408]
lrwx------ 1 nginx nginx 64 Sep  6 08:02 7 -> socket:[3829769]
lrwx------ 1 nginx nginx 64 Sep  6 08:02 8 -> socket:[3829765]

And for the custom modules:

nginx-statsd
ngx_devel_kit
set-misc-nginx-module
nginx-vod-module
lua-nginx-module

And we do not see any errors when we reload nginx.

erankor commented 8 years ago

there's something I don't understand about your conf - you don't have vod_mode there, which means it's running in local mode and yet you are setting vod_upstream_location. is there something else missing ?

manu844 commented 8 years ago

@erankor https://gist.github.com/manu844/ea0cd1d6a4266c5a5aacb007459fa259 is more detailed conf. Let me know if you need anything else.

erankor commented 8 years ago

I see this conf has the nginx-vod caches, but they are commented, are you sure the running configuration indeed doesn't have them ?

manu844 commented 8 years ago

@erankor yes we are sure its commented. At first when we deployed nginx-vod we had caches enabled but after that it started to consume memory so we commented it out and now even after commenting we are seeing memory leak.

erankor commented 8 years ago

It's pretty hard to understand what happens here remotely, what I can suggest is that you will set up some non-production server with your configuration and run valgrind to check if there are leaks. I've done it in the past with the configuration we use and found none (99% of the memory allocations are managed by nginx and get freed automatically when the request ends)

To do that, you will need to:

add this to your conf: worker_processes 1; daemon off; master_process off;
apply the patch matching your nginx version from here: https://github.com/openresty/no-pool-nginx
run: sudo /usr/local/bin/valgrind -v --tool=memcheck --leak-check=yes /usr/local/nginx/sbin/nginx
run a bunch of real requests from your logs
shut down nginx orderly with: /usr/local/nginx/sbin/nginx -s stop

manu844 commented 8 years ago

@erankor Thanks. we'll check this. Will get back to you with details next week.

erankor commented 8 years ago

Just thought of something worth checking - the number of active connections reported from nginx-status. When we had problems with stuck aio requests, we saw a big difference between the number reported by nginx-status and the actual number of connected sockets seen with netstat

erankor commented 8 years ago

@manu844, can this be closed ?

MansoorMajeed commented 8 years ago

@erankor really sorry about taking so much time. We didn't get a chance to do the testing using valgrind. We will be doing it this week itself. Can we please keep this open till then? :)

MansoorMajeed commented 8 years ago

@erankor we did the leak test using valgrind. But we couldn't find anything. Here is the output.

root@server:~# cat /var/log/valgrind.output
==23857== Conditional jump or move depends on uninitialised value(s)
==23857==    at 0x436C01: ngx_open_cached_file (ngx_open_file_cache.c:275)
==23857==    by 0x46DD78: ngx_http_static_handler (ngx_http_static_module.c:102)
==23857==    by 0x45192F: ngx_http_core_content_phase (ngx_http_core_module.c:1403)
==23857==    by 0x44C104: ngx_http_core_run_phases (ngx_http_core_module.c:873)
==23857==    by 0x457180: ngx_http_process_request (ngx_http_request.c:1910)
==23857==    by 0x457EBF: ngx_http_process_request_line (ngx_http_request.c:1021)
==23857==    by 0x440AAB: ngx_epoll_process_events (ngx_epoll_module.c:822)
==23857==    by 0x4391B1: ngx_process_events_and_timers (ngx_event.c:248)
==23857==    by 0x4403D7: ngx_single_process_cycle (ngx_process_cycle.c:308)
==23857==    by 0x41FE65: main (nginx.c:416)
==23857==
--23857-- Discarding syms at 0x6d77260-0x6d7baf8 in /lib/x86_64-linux-gnu/libnss_compat-2.13.so due to munmap()
--23857-- Discarding syms at 0x7198070-0x719e208 in /lib/x86_64-linux-gnu/libnss_nis-2.13.so due to munmap()
--23857-- Discarding syms at 0x6f82020-0x6f8dc48 in /lib/x86_64-linux-gnu/libnsl-2.13.so due to munmap()
--23857-- Discarding syms at 0x73a3180-0x73aa148 in /lib/x86_64-linux-gnu/libnss_files-2.13.so due to munmap()
--23857-- Discarding syms at 0x75ae000-0x75b10b8 in /lib/x86_64-linux-gnu/libnss_dns-2.13.so due to munmap()
--23857-- Discarding syms at 0x77b68c0-0x77c2978 in /lib/x86_64-linux-gnu/libresolv-2.13.so due to munmap()
==23857==
==23857== HEAP SUMMARY:
==23857==     in use at exit: 313,856 bytes in 2,763 blocks
==23857==   total heap usage: 48,949,677 allocs, 48,946,914 frees, 317,099,575,313 bytes allocated
==23857==
==23857== Searching for pointers to 2,763 not-freed blocks
==23857== Checked 760,672 bytes
==23857==
==23857== LEAK SUMMARY:
==23857==    definitely lost: 0 bytes in 0 blocks
==23857==    indirectly lost: 0 bytes in 0 blocks
==23857==      possibly lost: 0 bytes in 0 blocks
==23857==    still reachable: 313,856 bytes in 2,763 blocks
==23857==         suppressed: 0 bytes in 0 blocks
==23857== Reachable blocks (those to which a pointer was found) are not shown.
==23857== To see them, rerun with: --leak-check=full --show-reachable=yes
==23857==
==23857== Use --track-origins=yes to see where uninitialised values come from
==23857== ERROR SUMMARY: 495 errors from 1 contexts (suppressed: 4 from 4)
==23857==
==23857== 495 errors in context 1 of 1:
==23857== Conditional jump or move depends on uninitialised value(s)
==23857==    at 0x436C01: ngx_open_cached_file (ngx_open_file_cache.c:275)
==23857==    by 0x46DD78: ngx_http_static_handler (ngx_http_static_module.c:102)
==23857==    by 0x45192F: ngx_http_core_content_phase (ngx_http_core_module.c:1403)
==23857==    by 0x44C104: ngx_http_core_run_phases (ngx_http_core_module.c:873)
==23857==    by 0x457180: ngx_http_process_request (ngx_http_request.c:1910)
==23857==    by 0x457EBF: ngx_http_process_request_line (ngx_http_request.c:1021)
==23857==    by 0x440AAB: ngx_epoll_process_events (ngx_epoll_module.c:822)
==23857==    by 0x4391B1: ngx_process_events_and_timers (ngx_event.c:248)
==23857==    by 0x4403D7: ngx_single_process_cycle (ngx_process_cycle.c:308)
==23857==    by 0x41FE65: main (nginx.c:416)
==23857==
--23857--
--23857-- used_suppression:      4 dl-hack3-cond-1
==23857==
==23857== ERROR SUMMARY: 495 errors from 1 contexts (suppressed: 4 from 4)

Is there anything else we can test?

jessp01 commented 8 years ago

@MansoorMajeed, did you see Eran's comment here? https://github.com/kaltura/nginx-vod-module/issues/408#issuecomment-249250639

Can you go to http://your-nginx-host/nginx_status and compare the number of active connections with what netstat reports and see if there's a discrepancy there?

In your config, I see you're limiting access to that to 127.0.0.1: https://gist.github.com/manu844/ea0cd1d6a4266c5a5aacb007459fa259#file-gistfile1-txt-L56 you can use links or

curl http://127.0.0.1/nginx_status

from the Nginx machine's shell to get the info.

Another question: did you compile Nginx by yourself or are using the deb/RPM package? If by yourself, can you perhaps test with the precompiled packages and see if that makes a difference?

MansoorMajeed commented 8 years ago

@erankor @jessp01 Sorry, i missed that one.

I just checked it and we do see a huge difference.

root@pdvod-1:~# curl http://127.0.0.1:8089/nginx_status
Active connections: 2744
server accepts handled requests
 5018209 5018209 5381172
Reading: 0 Writing: 2103 Waiting: 641

root@pdvod-1:~# netstat -atn | grep :8089 | wc -l
79

And we do have aio enabled for this server block. Should I just disable aio? or how did you fix the issue with it?

Regarding your question about the nginx package, we have compiled it and created a deb package. Since we need some other modules, I'm not sure if it's possible to use a precompiled one.

jessp01 commented 8 years ago

Hm.. I would be interested in the values you have for:

# sysctl net.ipv4.tcp_fin_timeout
# sysctl net.ipv4.tcp_keepalive_time
# sysctl net.ipv4.tcp_tw_recycle
# sysctl net.ipv4.tcp_tw_reuse

Also, what additional modules have you got compiled into Nginx? perhaps they've got something to do with the issue as well, though I am bit doubtful..

MansoorMajeed commented 8 years ago

Here are the values:

root@pdvod-1:~# sysctl net.ipv4.tcp_fin_timeout
net.ipv4.tcp_fin_timeout = 60
root@pdvod-1:~# sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
root@pdvod-1:~# sysctl net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw_recycle = 0
root@pdvod-1:~# sysctl net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_reuse = 0
root@pdvod-1:~#

And as for the modules. These are the modules we use in addition to the core modules:

nginx-statsd
ngx_devel_kit
set-misc-nginx-module
nginx-vod-module
lua-nginx-module

As mentioned by @erankor, if aio is causing this issue, can we simply disable it?

jessp01 commented 8 years ago

I wouldn't recommend diabling aio just yet.. On your test machine, can you set: net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 30 and perhaps try a lower number for net.ipv4.tcp_keepalive_time as well?

I'll let @erankor update on the specific issue he's seen since I was not involved in that one but I suspect tweaking the above settings may help.

erankor commented 8 years ago

There's something I'm missing here, you wrote you are not getting these errors on reload: open socket #1234 left in connection 123 Are you sure about this ? note that it may take some time after the reload for this to appear (I think nginx waits until some timeout expires or until some connections are closed)

In our environment, I'm occasionally getting these errors (it can run for weeks without this problem, but sometimes the number of connections just jumps). I followed the instructions provided by nginx to debug it: https://www.nginx.com/resources/wiki/start/topics/tutorials/debugging/#socket-leaks and saw that all stuck requests were waiting on aio.

Unfortunately, at the moment, there is no solution to this, I'm not even sure the problem is in this module, because in our env the vast majority of stuck requests were progressive download (even without clipping), and nginx-vod-module hardly does anything with these, it just tells nginx to return the file. To test this issue, I tried to patch nginx to call 'io_cancel' right before printing this error message, and it didn't work (io_cancel returned EINVAL), so I think the request did complete, but for some reason the completion handler did not execute. Either way, I would not recommend disabling aio, you'd be better off reloading nginx every X days. If you disable aio it means that nginx-vod will handle a single request at a time per process - very inefficient.

Finally, a couple of additional questions -

are you using vod_open_file_thread_pool ? I didn't see it in your conf, but just want to make sure
are you serving the files over NFS ?

erankor commented 8 years ago

I checked one more thing - in our environment when it happens the read buffer has the correct data, which confirms my assumption that the aio read operation indeed completed, but for some reason the handler didn't execute. I think it's worth trying to increase worker_aio_requests, there is very little doc about this directive, but maybe if we increase its value (default value is 32) it will help

MansoorMajeed commented 8 years ago

I have checked the error logs and I am not seeing any open socket #1234 left in connection 123 errors. Also, when reloading nginx, I don't see any errors in stderr either.

Regarding your questions,

No, we are not using the directive vod_open_file_thread_pool at all.
The files are being proxied from a google cloud bucket. Will that make any difference?

Also, regarding the worker_aio_requests directive, what value do you suggest?

As of now, we are reloading nginx on a regular basis to get around this issue.

We do not receive that many requests on this specific service. Maybe 10-20 req per second, and we have three machines to handle the traffic. In that case, can we simply disable aio ?

jessp01 commented 8 years ago

Did you try tweaking the TCP kernel settings as I suggested?

MansoorMajeed commented 8 years ago

@jessp01 Sorry, I forgot to mention that. I have applied the kernel parameters on all the machines and it still didn't make any difference. :(

erankor commented 8 years ago

I suggest raising worker_aio_requests to 1024, it's a bit of a long shot, but maybe it will help. As I wrote, I don't think it's a good idea to disable aio, I expect this to have a detrimental effect on performance. But you can try it on one server and compare the request times. Btw, don't think we asked this before, but what version of Linux are you using ?

MansoorMajeed commented 8 years ago

We are using Debian 7 64bit. And, I'll start with increasing the worker_aio_requests and will compare the memory usage.

erankor commented 7 years ago

An update from our side - we managed to prove the connection between stuck connections and the number of aio contexts:

we added a log print to nginx code when nginx reaches the limit of aio contexts (io_submit fails with EAGAIN) and we can see that when this happens, the number of connections increases and never returns to what it was
we increased worker_aio_requests on one of our servers to 1024, and while all other servers had stuck connections at some point over the last month, this server did not

Did this resolve the problem for you as well ?

MansoorMajeed commented 7 years ago

We were adjusting with periodic nginx reloads. I have just added the worker aio requests parameter on one of the servers. Will report back in 2 days. Thanks for this :)

MansoorMajeed commented 7 years ago

@erankor An update from our side: Even with the directive, we are not seeing any positive result. However, one thing i just remembered, we are using nginx 1.8.1. Could you tell us which version of nginx you guys are using? Maybe we can recompile with a later version and it could help?

jessp01 commented 7 years ago

Up until very recently, we used Nginx 1.8.1. Currently we use the latest stable which is 1.10.2. Can you perhaps try to use our deb package [see https://github.com/kaltura/nginx-vod-module#debianubuntu-deb-package] and check if that makes a difference? I understand you also compile with the set-misc-nginx-module and lua-nginx-module modules, which we do not include by default but since 1.10.2 supports DSO you can load these additional modules without having to recompile the main Nginx binary.

MansoorMajeed commented 7 years ago

yes, we will try that.

erankor commented 7 years ago

On production we are still on 1.8.0, though I really doubt nginx version has anything to do with this. Based on the info provided so far, my guess is that it happens because of some other module, or maybe even something in the Lua code you are running (for example, if you have some global array that accumulates data or something similar) Since we increased worker_aio_requests, it is running on our environment for weeks, pushing 1-2 Gbps without any increase in memory usage.

jessp01 commented 7 years ago

I concur with Eran's assumption that it is not about the version. Esp. since 1.8.n works fine both on our SaaS and many CE deployments.

My guess is either one of the additional modules is causing an issue or, some third party library your Nginx loads to support one of the modules. Which is why I would like you to try our own deb for comparison. The version part is only relevant in the since that 1.10.n supports DSO whereas 1.8.n does not and so, with 1.10, if you see the deb works correctly, you can compile the additional modules as indep. shared objects, try to load them one by one and see when things go wrong.

MansoorMajeed commented 7 years ago

@erankor @jessp01 Hey guys, an update from our side. We have installed the kaltura-nginx, which does the trimming and all. For the additional modules we need, we run another nginx proxy on the each machine itself, which only does the calculation of sha1 hmac, and some other stuff which are needed to authenticate with google buckets. Will update about the memory usage in few days. Thank you 👍

jessp01 commented 7 years ago

Hi @MansoorMajeed,

Thanks for the update, let us know how it goes.

MansoorMajeed commented 7 years ago

@jessp01 @erankor Hey guys, an update: So, we are using the latest version of kaltura-nginx debian package along with another nginx proxy with the modules we require ( as mentioned before ) and we can confirm that the memory leak is not present now. The process releases memory properly and the issue seems to be fixed. Thank you so much for all the efforts. Should I close the issue now?

jessp01 commented 7 years ago

Yes, I'm fine with closing it since it is resolved. @erankor?

erankor commented 7 years ago

Thanks for the update

kaltura / nginx-vod-module

Memory Leak #408