kaltura / nginx-vod-module

NGINX-based MP4 Repackager
GNU Affero General Public License v3.0
1.99k stars 439 forks source link

Nginx eat all memory and server go to SWAP #1241

Open Vladislavik opened 3 years ago

Vladislavik commented 3 years ago

Hi, when I use vod_metadata_cache metadata_cache 30000m; I see, that every Nginx process show VIRT memory 30G, I have 48 such processes and 256GB total server memory, and every of this process show same, about 30G VIRT memory.

When I send to mutch traffic to this server, Nginx eat all memory on this server and go to SWAP.

Question: 1) vod_metadata_cache for all processes or for one? If for all, for what I need insert zone_name and no use it on other Nginx config location, (like limit_req_zone, when we insert it on server config and after this implement it on location) 2) Why command 'top' show 30G VIRT memory per Nginx process instead 30Gb per all processes? 3) If I want to get to all Nginx process 30Gb memory for vod_metadata_cache should I count it as vod_metadata_cache/Nginx process count

erankor commented 3 years ago
  1. The memory is shared, if you check with ps you'll see it counted on each process, but if you run free -m for example, you will see it's shared.
  2. It makes sense because the metadata cache is mapped to all processes, and consumes this amount of virtual space. But, in terms of physical memory, they are all mapped to the same physical pages.
  3. No... but 30G is a quite lot, I don't think you need so much, we are using 4G and getting cache hit ratio >95%.
Vladislavik commented 3 years ago

ok, I will try 4Gb How do you think, what param I should check to know what wrong with memory eating when there is access to many different files? I have a server with 24HDD and config like this:

worker_processes auto; (48)
worker_cpu_affinity auto;
thread_pool default_pool threads=16;
events {
    worker_connections  4096;
    use epoll;
    accept_mutex off;
    multi_accept on;
    worker_aio_requests 2048;
}
http{
    tcp_nopush     on;
    tcp_nodelay    on;

    vod_mode local;
    vod_fallback_upstream_location /fallback;
    vod_last_modified 'Sun, 19 Nov 2000 08:52:00 GMT';
    vod_last_modified_types *;
    vod_segment_duration 20000;
    vod_hls_absolute_master_urls off;
    vod_hls_absolute_index_urls off;
    vod_hls_container_format mpegts;
    vod_hls_absolute_iframe_urls off;
    vod_force_playlist_type_vod on;
    vod_hls_segment_file_name_prefix Frag;
    vod_open_file_thread_pool default_pool;
    vod_metadata_cache metadata_cache 4098m; #was 30000m
    vod_response_cache response_cache 128m;
    vod_performance_counters perf_counters;
    vod_output_buffer_pool 64k 32;
    vod_hls_mpegts_align_frames on;
    vod_hls_mpegts_interleave_frames on;

     open_file_cache          max=10000 inactive=2m;
     open_file_cache_valid    3h;
     open_file_cache_min_uses 1;
     open_file_cache_errors   on;

     sendfile on;
     sendfile_max_chunk 512k;

     aio            threads=default_pool;
     aio_write      on;
     send_timeout 20s;
     reset_timedout_connection on;

        server {
             output_buffers   1 512k;
                location @m3u8 {
                        root /var/www/$path/;
                        vod hls;
                }
        }
}

When traffic goes to about 6Gbps to not same files mostly and 10k not so fast users, Nginx goes from regular size memory (was 30Gb) to 100% memory (256Gb), and server go to SWAP and die. Disks before Nginx do bad thing, busy about 70%

erankor commented 3 years ago

Slow pulls from the module can indeed be a problem, since the module builds the entire request in memory, without waiting for it to be pulled. In general, the recommended approach for large scale deployments is to put a CDN/caching proxies in front of this module. This way the module is not expected to get slow pulls, and once a segment is pulled, it can be served to additional users from the CDN/proxy cache.

Vladislavik commented 3 years ago

maybe there is a way to regulate creating chunks, like read buffer size, while buffer full, not create new part of chunk, because we use CDN for popular content and not popular content gives this problems.

erankor commented 3 years ago

I don't think there's currently an elegant solution for this, you can maybe proxy these requests through another location, and have nginx buffer it to disk, or you can proxy the storage device and using nginx's rate limit there.

Vladislavik commented 3 years ago

about this problem update: when i have many slow requests from players I see Nginx start eating ram again and it can eat all memory on the server, when I stop traffic from the server, Nginx does not free memory, it continues hold full memory. Question why after traffic stop memory still full and Nginx not free it? Only if I restart Nginx it will free up memory.

erankor commented 3 years ago

It's probably because of the behavior of the heap, I've seen it on another project - even if all malloc'ed blocks are free'ed, the process memory does not go back to what it was. On the other project, this was problematic for me, so I made sure to allocate the memory in large chunks, and used mmap/munmap instead of malloc/free, that solved it.

Vladislavik commented 3 years ago

How to do this? I have i think another solution, when i disable keepalive from balancer to kaltura, i dont see memory eating anymore, buffers are cleaning up when connection closed only?

erankor commented 3 years ago
  1. It's not something you can configure... I changed the code on that other project to work this way
  2. The module allocates memory using the nginx request pool, I'm quite sure it gets freed when the request is finalized (not only on conn close). You can try maybe to limit the number of requests per conn/limit the time a conn can be reused, and see if it makes a difference.
Vladislavik commented 2 years ago

Can you do patch with mmap/munmap memory please, because again memory goes up and never free it, even if i remove all traffic from server and close all connections. I can clean memory only with nginx restart. Or maybe there is some bug in module that some times eat memory when many concurent connections to different files (cdn we use)

erankor commented 2 years ago

Sorry, I have no plans to implement such a patch, it doesn't make sense here... Some things you can check -

  1. Enable nginx stub status, and check if the number of active requests constantly increases - if there are aio requests that never close, it makes sense that it leaks.
  2. Run the module with debug enabled - nginx will log all calls to malloc/free/memalign, and then the log could be analyzed to check if something is leaking.
  3. Another option is to use valgrind - that will probably have a more significant impact on performance, and would also require the 'no pool' patch in order to get something actionable, but it will make it clear whether something is leaking.
Vladislavik commented 2 years ago

Why when i close all connection to have 0 mb/s traffic, nginx still eat memory? I think it is some memory leak exist in module.

erankor commented 2 years ago

I understand... what I wrote above still applies, start by checking the number of connections reported by nginx stub status, it can be >0 if the requests are blocked on IO

Vladislavik commented 2 years ago

When i block all traffic to server and check status page i see this: Active connections: 1 (1 - my request /nginx_status) server accepts handled requests 587912664 587912664 967363901 Reading: 0 Writing: 1 Waiting: 0 Memory used by nginx: 70Gb and not going down, if i restart nginx it will use 5-7Gb, while huge traffic not come

Vladislavik commented 2 years ago

This is screenshot of memory: https://ibb.co/ThvH81Q

erankor commented 2 years ago

Ok, so it's not stuck requests, that's good... Looking again at your conf, I see you have aio threads=default_pool, I never tried this setup, maybe try aio on instead?

Vladislavik commented 2 years ago

ok i will try it, but i dont know when i can tell you result, only if will again huge memory eating

Vladislavik commented 2 years ago

No, “aio on” it not help, after 6k online users to different videos memory goes up again

Active connections: 7195 server accepts handled requests 13868982 13868982 24693531 Reading: 0 Writing: 3112 Waiting: 4067

erankor commented 2 years ago

But here you show 7k active connections, or did you mean that also in this case after it dropped back to near zero, mem usage was still high? Another thing you can try, is to configure a server with identical setup to the production server, and test with valgrind. You'll need apply the no pool patch to it (https://github.com/openresty/no-pool-nginx) and run it as single process. You can pull a list of requests from your prod server, and replay them on this test server (can use this test script - https://github.com/kaltura/nginx-vod-module/blob/master/test/uri_compare.py). When you stop nginx orderly (nginx -s stop) valgrind will report if there are any leaks. I ran this test on my environment long time ago, and no leaks were found... but maybe in your case there's a leak due to different conf / some external lib etc.

Vladislavik commented 2 years ago

Yes after 7k connections if i again block all traffic, memory will high, no goes down.