Closed MansoorMajeed closed 7 years ago
are you using any of the caches of this module ? I don't see them in this conf snippet, but you maybe you didn't paste it. if so, then it makes sense that memory usage grows until the cache is populated.
another option is that requests are not being closed, are you seeing an increasing number of nginx connections ? (i'm usually looking at the number of handles in /proc/<nginx worker pid>/fd
)
if that is the case, it may be because of aio requests that get stuck, I see it occasionally in our env.
if you reload nginx, do you get errors such as open socket #1234 left in connection 123
?
finally, are you using any other custom modules ?
The above conf is all. Rest is some rewrite rules and auth stuff.
Output of top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25911 nginx 20 0 1008m 968m 1344 S 0.0 26.1 182:54.94 nginx: worker process
25910 nginx 20 0 939m 899m 1344 S 1.0 24.2 192:56.41 nginx: worker process
25909 nginx 20 0 772m 732m 1344 S 1.0 19.7 178:18.31 nginx: worker process
25912 nginx 20 0 584m 544m 1344 S 4.0 14.6 183:04.70 nginx: worker process
And the file descriptors
root@server:~# ls -l /proc/25911/fd/ | grep socket
lrwx------ 1 nginx nginx 64 Sep 6 08:02 11 -> socket:[3829768]
lrwx------ 1 nginx nginx 64 Sep 7 02:31 15 -> socket:[17675135]
lrwx------ 1 nginx nginx 64 Sep 6 08:02 17 -> socket:[17674720]
lrwx------ 1 nginx nginx 64 Sep 6 08:08 18 -> socket:[17674723]
lrwx------ 1 nginx nginx 64 Sep 6 08:02 23 -> socket:[17662717]
lrwx------ 1 nginx nginx 64 Sep 6 08:02 25 -> socket:[17662724]
lrwx------ 1 nginx nginx 64 Sep 6 08:02 3 -> socket:[3829763]
lrwx------ 1 nginx nginx 64 Sep 7 02:31 34 -> socket:[17662719]
lrwx------ 1 nginx nginx 64 Sep 7 02:31 37 -> socket:[17662723]
lrwx------ 1 nginx nginx 64 Sep 6 08:08 5 -> socket:[17675132]
lrwx------ 1 nginx nginx 64 Sep 6 08:02 6 -> socket:[3829408]
lrwx------ 1 nginx nginx 64 Sep 6 08:02 7 -> socket:[3829769]
lrwx------ 1 nginx nginx 64 Sep 6 08:02 8 -> socket:[3829765]
And for the custom modules:
nginx-statsd
ngx_devel_kit
set-misc-nginx-module
nginx-vod-module
lua-nginx-module
And we do not see any errors when we reload nginx.
there's something I don't understand about your conf - you don't have vod_mode there, which means it's running in local mode and yet you are setting vod_upstream_location. is there something else missing ?
@erankor https://gist.github.com/manu844/ea0cd1d6a4266c5a5aacb007459fa259 is more detailed conf. Let me know if you need anything else.
I see this conf has the nginx-vod caches, but they are commented, are you sure the running configuration indeed doesn't have them ?
@erankor yes we are sure its commented. At first when we deployed nginx-vod we had caches enabled but after that it started to consume memory so we commented it out and now even after commenting we are seeing memory leak.
It's pretty hard to understand what happens here remotely, what I can suggest is that you will set up some non-production server with your configuration and run valgrind to check if there are leaks. I've done it in the past with the configuration we use and found none (99% of the memory allocations are managed by nginx and get freed automatically when the request ends)
To do that, you will need to:
worker_processes 1; daemon off; master_process off;
sudo /usr/local/bin/valgrind -v --tool=memcheck --leak-check=yes /usr/local/nginx/sbin/nginx
/usr/local/nginx/sbin/nginx -s stop
@erankor Thanks. we'll check this. Will get back to you with details next week.
Just thought of something worth checking - the number of active connections reported from nginx-status. When we had problems with stuck aio requests, we saw a big difference between the number reported by nginx-status and the actual number of connected sockets seen with netstat
@manu844, can this be closed ?
@erankor really sorry about taking so much time. We didn't get a chance to do the testing using valgrind. We will be doing it this week itself. Can we please keep this open till then? :)
@erankor we did the leak test using valgrind. But we couldn't find anything. Here is the output.
root@server:~# cat /var/log/valgrind.output
==23857== Conditional jump or move depends on uninitialised value(s)
==23857== at 0x436C01: ngx_open_cached_file (ngx_open_file_cache.c:275)
==23857== by 0x46DD78: ngx_http_static_handler (ngx_http_static_module.c:102)
==23857== by 0x45192F: ngx_http_core_content_phase (ngx_http_core_module.c:1403)
==23857== by 0x44C104: ngx_http_core_run_phases (ngx_http_core_module.c:873)
==23857== by 0x457180: ngx_http_process_request (ngx_http_request.c:1910)
==23857== by 0x457EBF: ngx_http_process_request_line (ngx_http_request.c:1021)
==23857== by 0x440AAB: ngx_epoll_process_events (ngx_epoll_module.c:822)
==23857== by 0x4391B1: ngx_process_events_and_timers (ngx_event.c:248)
==23857== by 0x4403D7: ngx_single_process_cycle (ngx_process_cycle.c:308)
==23857== by 0x41FE65: main (nginx.c:416)
==23857==
--23857-- Discarding syms at 0x6d77260-0x6d7baf8 in /lib/x86_64-linux-gnu/libnss_compat-2.13.so due to munmap()
--23857-- Discarding syms at 0x7198070-0x719e208 in /lib/x86_64-linux-gnu/libnss_nis-2.13.so due to munmap()
--23857-- Discarding syms at 0x6f82020-0x6f8dc48 in /lib/x86_64-linux-gnu/libnsl-2.13.so due to munmap()
--23857-- Discarding syms at 0x73a3180-0x73aa148 in /lib/x86_64-linux-gnu/libnss_files-2.13.so due to munmap()
--23857-- Discarding syms at 0x75ae000-0x75b10b8 in /lib/x86_64-linux-gnu/libnss_dns-2.13.so due to munmap()
--23857-- Discarding syms at 0x77b68c0-0x77c2978 in /lib/x86_64-linux-gnu/libresolv-2.13.so due to munmap()
==23857==
==23857== HEAP SUMMARY:
==23857== in use at exit: 313,856 bytes in 2,763 blocks
==23857== total heap usage: 48,949,677 allocs, 48,946,914 frees, 317,099,575,313 bytes allocated
==23857==
==23857== Searching for pointers to 2,763 not-freed blocks
==23857== Checked 760,672 bytes
==23857==
==23857== LEAK SUMMARY:
==23857== definitely lost: 0 bytes in 0 blocks
==23857== indirectly lost: 0 bytes in 0 blocks
==23857== possibly lost: 0 bytes in 0 blocks
==23857== still reachable: 313,856 bytes in 2,763 blocks
==23857== suppressed: 0 bytes in 0 blocks
==23857== Reachable blocks (those to which a pointer was found) are not shown.
==23857== To see them, rerun with: --leak-check=full --show-reachable=yes
==23857==
==23857== Use --track-origins=yes to see where uninitialised values come from
==23857== ERROR SUMMARY: 495 errors from 1 contexts (suppressed: 4 from 4)
==23857==
==23857== 495 errors in context 1 of 1:
==23857== Conditional jump or move depends on uninitialised value(s)
==23857== at 0x436C01: ngx_open_cached_file (ngx_open_file_cache.c:275)
==23857== by 0x46DD78: ngx_http_static_handler (ngx_http_static_module.c:102)
==23857== by 0x45192F: ngx_http_core_content_phase (ngx_http_core_module.c:1403)
==23857== by 0x44C104: ngx_http_core_run_phases (ngx_http_core_module.c:873)
==23857== by 0x457180: ngx_http_process_request (ngx_http_request.c:1910)
==23857== by 0x457EBF: ngx_http_process_request_line (ngx_http_request.c:1021)
==23857== by 0x440AAB: ngx_epoll_process_events (ngx_epoll_module.c:822)
==23857== by 0x4391B1: ngx_process_events_and_timers (ngx_event.c:248)
==23857== by 0x4403D7: ngx_single_process_cycle (ngx_process_cycle.c:308)
==23857== by 0x41FE65: main (nginx.c:416)
==23857==
--23857--
--23857-- used_suppression: 4 dl-hack3-cond-1
==23857==
==23857== ERROR SUMMARY: 495 errors from 1 contexts (suppressed: 4 from 4)
Is there anything else we can test?
@MansoorMajeed, did you see Eran's comment here? https://github.com/kaltura/nginx-vod-module/issues/408#issuecomment-249250639
Can you go to http://your-nginx-host/nginx_status and compare the number of active connections with what netstat reports and see if there's a discrepancy there?
In your config, I see you're limiting access to that to 127.0.0.1: https://gist.github.com/manu844/ea0cd1d6a4266c5a5aacb007459fa259#file-gistfile1-txt-L56 you can use links or
curl http://127.0.0.1/nginx_status
from the Nginx machine's shell to get the info.
Another question: did you compile Nginx by yourself or are using the deb/RPM package? If by yourself, can you perhaps test with the precompiled packages and see if that makes a difference?
@erankor @jessp01 Sorry, i missed that one.
I just checked it and we do see a huge difference.
root@pdvod-1:~# curl http://127.0.0.1:8089/nginx_status
Active connections: 2744
server accepts handled requests
5018209 5018209 5381172
Reading: 0 Writing: 2103 Waiting: 641
root@pdvod-1:~# netstat -atn | grep :8089 | wc -l
79
And we do have aio enabled for this server block. Should I just disable aio? or how did you fix the issue with it?
Regarding your question about the nginx package, we have compiled it and created a deb package. Since we need some other modules, I'm not sure if it's possible to use a precompiled one.
Hm.. I would be interested in the values you have for:
# sysctl net.ipv4.tcp_fin_timeout
# sysctl net.ipv4.tcp_keepalive_time
# sysctl net.ipv4.tcp_tw_recycle
# sysctl net.ipv4.tcp_tw_reuse
Also, what additional modules have you got compiled into Nginx? perhaps they've got something to do with the issue as well, though I am bit doubtful..
Here are the values:
root@pdvod-1:~# sysctl net.ipv4.tcp_fin_timeout
net.ipv4.tcp_fin_timeout = 60
root@pdvod-1:~# sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
root@pdvod-1:~# sysctl net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw_recycle = 0
root@pdvod-1:~# sysctl net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_reuse = 0
root@pdvod-1:~#
And as for the modules. These are the modules we use in addition to the core modules:
nginx-statsd
ngx_devel_kit
set-misc-nginx-module
nginx-vod-module
lua-nginx-module
As mentioned by @erankor, if aio is causing this issue, can we simply disable it?
I wouldn't recommend diabling aio just yet.. On your test machine, can you set: net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 30 and perhaps try a lower number for net.ipv4.tcp_keepalive_time as well?
I'll let @erankor update on the specific issue he's seen since I was not involved in that one but I suspect tweaking the above settings may help.
There's something I'm missing here, you wrote you are not getting these errors on reload:
open socket #1234 left in connection 123
Are you sure about this ? note that it may take some time after the reload for this to appear (I think nginx waits until some timeout expires or until some connections are closed)
In our environment, I'm occasionally getting these errors (it can run for weeks without this problem, but sometimes the number of connections just jumps). I followed the instructions provided by nginx to debug it: https://www.nginx.com/resources/wiki/start/topics/tutorials/debugging/#socket-leaks and saw that all stuck requests were waiting on aio.
Unfortunately, at the moment, there is no solution to this, I'm not even sure the problem is in this module, because in our env the vast majority of stuck requests were progressive download (even without clipping), and nginx-vod-module hardly does anything with these, it just tells nginx to return the file. To test this issue, I tried to patch nginx to call 'io_cancel' right before printing this error message, and it didn't work (io_cancel returned EINVAL), so I think the request did complete, but for some reason the completion handler did not execute. Either way, I would not recommend disabling aio, you'd be better off reloading nginx every X days. If you disable aio it means that nginx-vod will handle a single request at a time per process - very inefficient.
Finally, a couple of additional questions -
I checked one more thing - in our environment when it happens the read buffer has the correct data, which confirms my assumption that the aio read operation indeed completed, but for some reason the handler didn't execute.
I think it's worth trying to increase worker_aio_requests
, there is very little doc about this directive, but maybe if we increase its value (default value is 32) it will help
I have checked the error logs and I am not seeing any open socket #1234 left in connection 123
errors. Also, when reloading nginx, I don't see any errors in stderr either.
Regarding your questions,
vod_open_file_thread_pool
at all.Also, regarding the worker_aio_requests
directive, what value do you suggest?
As of now, we are reloading nginx on a regular basis to get around this issue.
We do not receive that many requests on this specific service. Maybe 10-20 req per second, and we have three machines to handle the traffic. In that case, can we simply disable aio ?
Did you try tweaking the TCP kernel settings as I suggested?
@jessp01 Sorry, I forgot to mention that. I have applied the kernel parameters on all the machines and it still didn't make any difference. :(
I suggest raising worker_aio_requests to 1024, it's a bit of a long shot, but maybe it will help. As I wrote, I don't think it's a good idea to disable aio, I expect this to have a detrimental effect on performance. But you can try it on one server and compare the request times. Btw, don't think we asked this before, but what version of Linux are you using ?
We are using Debian 7 64bit. And, I'll start with increasing the worker_aio_requests and will compare the memory usage.
An update from our side - we managed to prove the connection between stuck connections and the number of aio contexts:
Did this resolve the problem for you as well ?
We were adjusting with periodic nginx reloads. I have just added the worker aio requests parameter on one of the servers. Will report back in 2 days. Thanks for this :)
@erankor An update from our side: Even with the directive, we are not seeing any positive result. However, one thing i just remembered, we are using nginx 1.8.1. Could you tell us which version of nginx you guys are using? Maybe we can recompile with a later version and it could help?
Up until very recently, we used Nginx 1.8.1. Currently we use the latest stable which is 1.10.2. Can you perhaps try to use our deb package [see https://github.com/kaltura/nginx-vod-module#debianubuntu-deb-package] and check if that makes a difference? I understand you also compile with the set-misc-nginx-module and lua-nginx-module modules, which we do not include by default but since 1.10.2 supports DSO you can load these additional modules without having to recompile the main Nginx binary.
yes, we will try that.
On production we are still on 1.8.0, though I really doubt nginx version has anything to do with this. Based on the info provided so far, my guess is that it happens because of some other module, or maybe even something in the Lua code you are running (for example, if you have some global array that accumulates data or something similar) Since we increased worker_aio_requests, it is running on our environment for weeks, pushing 1-2 Gbps without any increase in memory usage.
I concur with Eran's assumption that it is not about the version. Esp. since 1.8.n works fine both on our SaaS and many CE deployments.
My guess is either one of the additional modules is causing an issue or, some third party library your Nginx loads to support one of the modules. Which is why I would like you to try our own deb for comparison. The version part is only relevant in the since that 1.10.n supports DSO whereas 1.8.n does not and so, with 1.10, if you see the deb works correctly, you can compile the additional modules as indep. shared objects, try to load them one by one and see when things go wrong.
@erankor @jessp01 Hey guys, an update from our side. We have installed the kaltura-nginx, which does the trimming and all. For the additional modules we need, we run another nginx proxy on the each machine itself, which only does the calculation of sha1 hmac, and some other stuff which are needed to authenticate with google buckets. Will update about the memory usage in few days. Thank you 👍
Hi @MansoorMajeed,
Thanks for the update, let us know how it goes.
@jessp01 @erankor Hey guys, an update: So, we are using the latest version of kaltura-nginx debian package along with another nginx proxy with the modules we require ( as mentioned before ) and we can confirm that the memory leak is not present now. The process releases memory properly and the issue seems to be fixed. Thank you so much for all the efforts. Should I close the issue now?
Yes, I'm fine with closing it since it is resolved. @erankor?
Thanks for the update
@erankor we were using the module for a couple of weeks and it's been working great. But, we have noticed that the nginx worker processes are getting heavier and heavier. Right now, each nginx worker weighs ~ 900MB each. The processes were running since Aug 18.
The following is the nginx conf [redacted non important directives]
Any idea why this might be happening? :)