Memory leak in log_write/warnings to stderr not flushing

Jamie0 commented 1 year ago

I've observed that, when streaming to a mount point in HE-AAC format, icecast-kh leaks memory. Roughly every minute, 250-260k of memory is leaked.

The culprit seems to be the call to WARN3 in validate_mpeg on line 944. Commenting out this line and recompiling plugs the leak. Seemingly the log entry is not correctly written to stderr so the entry's memory is never freed.

I've included the result of running valgrind with --leak-check=full --leak-resolution=high --num-callers=20

==947== 
==947== Thread 1:
==947== 83,856 bytes in 3,494 blocks are possibly lost in loss record 4 of 7
==947==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==947==    by 0x14ABC1: create_log_entry (log.c:573)
==947==    by 0x14BBEE: log_write (log.c:661)
==947==    by 0x136C2B: validate_mpeg (format_mp3.c:944)
==947==    by 0x138988: mp3_get_no_meta (format_mp3.c:972)
==947==    by 0x122D57: source_read (source.c:671)
==947==    by 0x123520: source_client_read (source.c:797)
==947==    by 0x12B6E3: worker (client.c:759)
==947==    by 0x148FF0: _start_routine (thread.c:725)
==947==    by 0x4E89608: start_thread (pthread_create.c:477)
==947==    by 0x4FC3132: clone (clone.S:95)
==947== 
==947== 153,120 (24 direct, 153,096 indirect) bytes in 1 blocks are definitely lost in loss record 6 of 7
==947==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==947==    by 0x14ABC1: create_log_entry (log.c:573)
==947==    by 0x14BBEE: log_write (log.c:661)
==947==    by 0x136C2B: validate_mpeg (format_mp3.c:944)
==947==    by 0x138988: mp3_get_no_meta (format_mp3.c:972)
==947==    by 0x122D57: source_read (source.c:671)
==947==    by 0x123520: source_client_read (source.c:797)
==947==    by 0x12B6E3: worker (client.c:759)
==947==    by 0x148FF0: _start_routine (thread.c:725)
==947==    by 0x4E89608: start_thread (pthread_create.c:477)
==947==    by 0x4FC3132: clone (clone.S:95)
==947== 
==947== 377,352 bytes in 3,494 blocks are possibly lost in loss record 7 of 7
==947==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==947==    by 0x14AC14: create_log_entry (log.c:575)
==947==    by 0x14BBEE: log_write (log.c:661)
==947==    by 0x136C2B: validate_mpeg (format_mp3.c:944)
==947==    by 0x138988: mp3_get_no_meta (format_mp3.c:972)
==947==    by 0x122D57: source_read (source.c:671)
==947==    by 0x123520: source_client_read (source.c:797)
==947==    by 0x12B6E3: worker (client.c:759)
==947==    by 0x148FF0: _start_routine (thread.c:725)
==947==    by 0x4E89608: start_thread (pthread_create.c:477)
==947==    by 0x4FC3132: clone (clone.S:95)
==947== 
==947== LEAK SUMMARY:
==947==    definitely lost: 24 bytes in 1 blocks
==947==    indirectly lost: 153,096 bytes in 2,319 blocks
==947==      possibly lost: 461,208 bytes in 6,988 blocks
==947==    still reachable: 128 bytes in 2 blocks
==947==         suppressed: 0 bytes in 0 blocks
==947== Reachable blocks (those to which a pointer was found) are not shown.
==947== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==947== 
==947== Use --track-origins=yes to see where uninitialised values come from
==947== For lists of detected and suppressed errors, rerun with: -s
==947== ERROR SUMMARY: 10 errors from 10 contexts (suppressed: 0 from 0)

Jamie0 commented 1 year ago

Having dug a bit deeper into the code there definitely seems to be a fundamental issue with flushing error/warning logs to stderr.

Jamie0 commented 1 year ago

It seems the issue is caused by the logs never being flushed from memory. A worker thread calls log_write, but as the main thread never logs any type of error, the warnings persist in memory waiting to be flushed indefinitely. We are successfully working around this by reducing the logging level from 2 to 1.

karlheyes commented 1 year ago

just trying to get a handle on this. I'm trying a stderr output and even forcing extra warning messages to cause a leak and it just had the default 100 lines internally. Besides the - error log and level 2.

I'm also interested in the trigger for the message in question. Do you have a stream that shows it?

Jamie0 commented 1 year ago

Closing as unfortuntely I'm not able to reproduce this at all. For context, this issue affected one or two customers at a commercial streaming provider and I could reproduce it before by re-casting their stream (without reencoding) to a test server. Trying again on those same streams no longer seems to cause the leak (and Valgrind doesn't report those leaks) so I think something must have been up with their encoding client generating bad or wrongly padded frames. Nonetheless it's odd that the logs didn't flush and stayed in memory when it happened, so if I spot it again I'll reopen the issue with hopefully more info.

Jamie0 commented 1 day ago

I've managed to run into this again, and indeed the valgrind result from last time was a bit of a red herring.

We run icecast-kh under supervisord (using some code that was forked from AzuraCast before they relicensed), and its configured to write its error logs to /dev/stderr, which is a symlink of /proc/fd/self/2.

If we instead log to a normal file, the issue goes away.

Having a peek in log.c/log_write (in the now slightly old version we're running), the code that frees old logs doesn't run if the log file size doesn't grow.

This code isn't present in the main branch any more, but I can see there is still a call to fstat on the log files, so I'm tentatively re-opening the issue

karlheyes / icecast-kh

Memory leak in log_write/warnings to stderr not flushing #370