Nginx eat all memory and server go to SWAP

Vladislavik commented 3 years ago

Hi, when I use vod_metadata_cache metadata_cache 30000m; I see, that every Nginx process show VIRT memory 30G, I have 48 such processes and 256GB total server memory, and every of this process show same, about 30G VIRT memory.

When I send to mutch traffic to this server, Nginx eat all memory on this server and go to SWAP.

Question: 1) vod_metadata_cache for all processes or for one? If for all, for what I need insert zone_name and no use it on other Nginx config location, (like limit_req_zone, when we insert it on server config and after this implement it on location) 2) Why command 'top' show 30G VIRT memory per Nginx process instead 30Gb per all processes? 3) If I want to get to all Nginx process 30Gb memory for vod_metadata_cache should I count it as vod_metadata_cache/Nginx process count

erankor commented 3 years ago

The memory is shared, if you check with ps you'll see it counted on each process, but if you run free -m for example, you will see it's shared.
It makes sense because the metadata cache is mapped to all processes, and consumes this amount of virtual space. But, in terms of physical memory, they are all mapped to the same physical pages.
No... but 30G is a quite lot, I don't think you need so much, we are using 4G and getting cache hit ratio >95%.

Vladislavik commented 3 years ago

ok, I will try 4Gb How do you think, what param I should check to know what wrong with memory eating when there is access to many different files? I have a server with 24HDD and config like this:

worker_processes auto; (48)
worker_cpu_affinity auto;
thread_pool default_pool threads=16;
events {
    worker_connections  4096;
    use epoll;
    accept_mutex off;
    multi_accept on;
    worker_aio_requests 2048;
}
http{
    tcp_nopush     on;
    tcp_nodelay    on;

    vod_mode local;
    vod_fallback_upstream_location /fallback;
    vod_last_modified 'Sun, 19 Nov 2000 08:52:00 GMT';
    vod_last_modified_types *;
    vod_segment_duration 20000;
    vod_hls_absolute_master_urls off;
    vod_hls_absolute_index_urls off;
    vod_hls_container_format mpegts;
    vod_hls_absolute_iframe_urls off;
    vod_force_playlist_type_vod on;
    vod_hls_segment_file_name_prefix Frag;
    vod_open_file_thread_pool default_pool;
    vod_metadata_cache metadata_cache 4098m; #was 30000m
    vod_response_cache response_cache 128m;
    vod_performance_counters perf_counters;
    vod_output_buffer_pool 64k 32;
    vod_hls_mpegts_align_frames on;
    vod_hls_mpegts_interleave_frames on;

     open_file_cache          max=10000 inactive=2m;
     open_file_cache_valid    3h;
     open_file_cache_min_uses 1;
     open_file_cache_errors   on;

     sendfile on;
     sendfile_max_chunk 512k;

     aio            threads=default_pool;
     aio_write      on;
     send_timeout 20s;
     reset_timedout_connection on;

        server {
             output_buffers   1 512k;
                location @m3u8 {
                        root /var/www/$path/;
                        vod hls;
                }
        }
}

When traffic goes to about 6Gbps to not same files mostly and 10k not so fast users, Nginx goes from regular size memory (was 30Gb) to 100% memory (256Gb), and server go to SWAP and die. Disks before Nginx do bad thing, busy about 70%

erankor commented 3 years ago

Slow pulls from the module can indeed be a problem, since the module builds the entire request in memory, without waiting for it to be pulled. In general, the recommended approach for large scale deployments is to put a CDN/caching proxies in front of this module. This way the module is not expected to get slow pulls, and once a segment is pulled, it can be served to additional users from the CDN/proxy cache.

Vladislavik commented 3 years ago

maybe there is a way to regulate creating chunks, like read buffer size, while buffer full, not create new part of chunk, because we use CDN for popular content and not popular content gives this problems.

erankor commented 3 years ago

I don't think there's currently an elegant solution for this, you can maybe proxy these requests through another location, and have nginx buffer it to disk, or you can proxy the storage device and using nginx's rate limit there.

Vladislavik commented 3 years ago

about this problem update: when i have many slow requests from players I see Nginx start eating ram again and it can eat all memory on the server, when I stop traffic from the server, Nginx does not free memory, it continues hold full memory. Question why after traffic stop memory still full and Nginx not free it? Only if I restart Nginx it will free up memory.

erankor commented 3 years ago

It's probably because of the behavior of the heap, I've seen it on another project - even if all malloc'ed blocks are free'ed, the process memory does not go back to what it was. On the other project, this was problematic for me, so I made sure to allocate the memory in large chunks, and used mmap/munmap instead of malloc/free, that solved it.

Vladislavik commented 3 years ago

How to do this? I have i think another solution, when i disable keepalive from balancer to kaltura, i dont see memory eating anymore, buffers are cleaning up when connection closed only?

erankor commented 3 years ago

It's not something you can configure... I changed the code on that other project to work this way
The module allocates memory using the nginx request pool, I'm quite sure it gets freed when the request is finalized (not only on conn close). You can try maybe to limit the number of requests per conn/limit the time a conn can be reused, and see if it makes a difference.