profanity-im / profanity

Ncurses based XMPP client
https://profanity-im.github.io/
Other
1.31k stars 183 forks source link

Substantial memory leak on aarch64 and x86_64 architectures #1946

Open kseistrup opened 8 months ago

kseistrup commented 8 months ago

TL;DR: Profanity ate 1 GB swap in less than 15 days

I have profanity running on a RPi4 (aarch64 architecture) that is running ArchLinuxARM. Profanity receives notifications from e.g. cron jobs and casual interpersonal chat, altogether less than a couple of hundred messages daily — some days substantially less.

The RPi4 has 4 GB RAM + 1 GB swap. Usually no swap is used at all, so at first I didn't notice when free -tk reported that used and free swap had swapped places, so to speak: 100% used swap and 0% free. This was when the RPi4 and profanity had been up for slightly less than 15 days.

A ps ux revealed that the profanity process had gobbled up 2.5+ GB RAM, including all of the available 1 GB swap. Taking profanity down released all of the used swap, except ~40 MB.

Expected Behavior

Ideally, profanity should not keep gobbling up RAM.

Current Behavior

Launching profanity again, this is what I noticed (numbers taken from ps ux):

HH:MM     VSZ     RSS
-----··------··------
??:??  188208  163968
18:58  249260  225152
21:42  286228  262016
07:24  346020  321792

HH:MM is wallclock hours and minutes.

I tried the same thing on a x86_64 desktop pc, running ArchLinux:

HH:MM     VSZ     RSS
-----··------··------
??:??   55120   33976
18:13  280880   40264
22:13  316360   77128
07:21  360808  121544

Possible Solution

I haven't got a clue.

Steps to Reproduce (for bugs)

  1. Launch profanity in a terminal emulator
  2. Regularly run e.g. ps ux | rg pro[f]anity to see its increasing memory consumption

Context

N/A

Environment

$ profanity -v
Profanity, version 0.14.0
Copyright (C) 2012 - 2019 James Booth <boothj5web@gmail.com>.
Copyright (C) 2019 - 2023 Michael Vetter <jubalh@iodoru.org>.
License GPLv3+: GNU GPL version 3 or later <https://www.gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Build information:
XMPP library: libstrophe
Desktop notification support: Enabled
OTR support: Enabled (libotr 4.1.1)
PGP support: Enabled (libgpgme 1.23.2)
OMEMO support: Enabled
C plugins: Enabled
Python plugins: Enabled (3.11.6)
GTK icons/clipboard: Disabled
GDK Pixbuf: Enabled

Options used when building the package can be seen here:

Cheers.

jubalh commented 3 months ago

I run valgrind like this: G_DEBUG=gc-friendly G_SLICE=always-malloc valgrind --tool=memcheck --track-origins=yes --leak-check=full --leak-resolution=high --num-callers=30 --log-file=profval --show-reachable=yes --suppressions=prof.supp ./profanity

The output: profanity-valgrind-output.txt

UI buffer

Probably the most frequent one:

==23236== 1 bytes in 1 blocks are still reachable in loss record 1 of 3,077
==23236==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==23236==    by 0x530C75D: strdup (in /usr/lib64/libc.so.6)
==23236==    by 0x48A2B2: _create_entry (buffer.c:201)
==23236==    by 0x489E95: _buffer_add (buffer.c:107)
==23236==    by 0x489DDF: buffer_append (buffer.c:95)
==23236==    by 0x468067: win_println_va_internal (window.c:1539)
==23236==    by 0x46826F: win_println (window.c:1565)
==23236==    by 0x47C421: _cons_splash_logo (console.c:2706)
==23236==    by 0x475565: cons_about (console.c:388)
==23236==    by 0x46A622: ui_init (core.c:108)
==23236==    by 0x431BDD: _init (profanity.c:193)
==23236==    by 0x431966: prof_run (profanity.c:98)
==23236==    by 0x4EDAFC: main (main.c:174)

https://github.com/profanity-im/profanity/blob/beeddda5685d401214d08ac65b25df74efaf5c53/src/ui/buffer.c#L201-L205

Should later get freed by buffer_free() in which we call g_slist_free_full(buffer->entries, (GDestroyNotify)_free_entry);. The buffer/window/ui code is a bit hard to read/follow so I'm not yet 100% sure if we call the free function in all scenarios. And if not in one case we are assigning again. But lets not that is using glib data container function.

Maybe we have a mistake here so the more we "write" the more the needed memory will grow because somehow we don't free.

autocomplete

==23236== 3 bytes in 1 blocks are still reachable in loss record 64 of 3,077
==23236==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==23236==    by 0x530C75D: strdup (in /usr/lib64/libc.so.6)
==23236==    by 0x4C5C33: autocomplete_add (autocomplete.c:175)
==23236==    by 0x4B3FD4: cmd_ac_init (cmd_ac.c:529)
==23236==    by 0x4966E4: cmd_init (cmd_defs.c:2851)
==23236==    by 0x431C3D: _init (profanity.c:200)
==23236==    by 0x431966: prof_run (profanity.c:98)
==23236==    by 0x4EDAFC: main (main.c:174)

https://github.com/profanity-im/profanity/blob/beeddda5685d401214d08ac65b25df74efaf5c53/src/tools/autocomplete.c#L163-L178

In line 175 we again use strdup() and add it to a glib container. It is freed via:

https://github.com/profanity-im/profanity/blob/beeddda5685d401214d08ac65b25df74efaf5c53/src/tools/autocomplete.c#L78

When reading the surrounding code it looks to me like the free function will be called in appropriate places. Again its a glib container.

ncurses

==23248== 9 bytes in 1 blocks are still reachable in loss record 729 of 3,099
==23248==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==23248==    by 0x530C75D: strdup (in /usr/lib64/libc.so.6)
==23248==    by 0x5127295: ??? (in /usr/lib64/libtinfo.so.6.5)
==23248==    by 0x512BC4F: _nc_tiparm (in /usr/lib64/libtinfo.so.6.5)
==23248==    by 0x50F55A4: newterm_sp (in /usr/lib64/libncursesw.so.6.5)
==23248==    by 0x50F60A5: newterm (in /usr/lib64/libncursesw.so.6.5)
==23248==    by 0x50ED356: initscr (in /usr/lib64/libncursesw.so.6.5)
==23248==    by 0x46A5BD: ui_init (core.c:95)
==23248==    by 0x431BDD: _init (profanity.c:193)
==23248==    by 0x431966: prof_run (profanity.c:98)
==23248==    by 0x4EDAFC: main (main.c:174)

Which is confusing. We use the same setup/teardown for ncurses like https://gist.github.com/DesmondHayes/0c80af2654fc3ec602f9 for example.

I went on to get the ncurses tarball and see if they ship examples. They have a couple of tests and I ran valgrind on test/firstlast which gave me leaks as well? Still need to follow up on this post.

libstrophe / ssl

==23248== 9 bytes in 1 blocks are still reachable in loss record 735 of 3,099
==23248==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==23248==    by 0x5929EA5: CRYPTO_strdup (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x589602C: ??? (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x5896143: CONF_module_add (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x58961A1: OPENSSL_load_builtin_modules (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x58961F8: ??? (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x52F9E7E: __pthread_once_slow (in /usr/lib64/libc.so.6)
==23248==    by 0x59341B8: CRYPTO_THREAD_run_once (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x5896466: CONF_modules_load (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x5896BE9: CONF_modules_load_file_ex (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x592CAB3: ??? (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x52F9E7E: __pthread_once_slow (in /usr/lib64/libc.so.6)
==23248==    by 0x59341B8: CRYPTO_THREAD_run_once (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x59301F5: OPENSSL_init_crypto (in /usr/lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4)
==23248==    by 0x56C7340: OPENSSL_init_ssl (in /usr/lib64/glibc-hwcaps/x86-64-v3/libssl.so.3.1.4)
==23248==    by 0x5159D4F: xmpp_initialize (in /usr/lib64/libstrophe.so.0.13.1)
==23248==    by 0x439631: connection_init (connection.c:131)
==23248==    by 0x438514: session_init (session.c:106)
==23248==    by 0x431C38: _init (profanity.c:199)
==23248==    by 0x431966: prof_run (profanity.c:98)
==23248==    by 0x4EDAFC: main (main.c:174)
==23248== 16 bytes in 1 blocks are still reachable in loss record 1,075 of 3,099
==23248==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==23248==    by 0x43958A: _xmalloc (connection.c:102)
==23248==    by 0x515DBD9: xmpp_conn_new (in /usr/lib64/libstrophe.so.0.13.1)
==23248==    by 0x43974E: connection_init (connection.c:153)
==23248==    by 0x438514: session_init (session.c:106)
==23248==    by 0x431C38: _init (profanity.c:199)
==23248==    by 0x431966: prof_run (profanity.c:98)
==23248==    by 0x4EDAFC: main (main.c:174)

Something for @sjaeckel ?

theme

==23248== 18 bytes in 1 blocks are still reachable in loss record 1,092 of 3,099
==23248==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==23248==    by 0x530C75D: strdup (in /usr/lib64/libc.so.6)
==23248==    by 0x4D0C0E: theme_init (theme.c:84)
==23248==    by 0x431BCC: _init (profanity.c:190)
==23248==    by 0x431966: prof_run (profanity.c:98)
==23248==    by 0x4EDAFC: main (main.c:174)

https://github.com/profanity-im/profanity/blob/beeddda5685d401214d08ac65b25df74efaf5c53/src/config/theme.c#L80-L84

Again a glib container and we pass the relevant free functions to g_hash_table_new_full()

readline

==23248== 228 bytes in 140 blocks are still reachable in loss record 2,918 of 3,099
==23248==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==23248==    by 0x50BA488: xmalloc (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x5098534: rl_macro_bind (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50A8C3B: rl_parse_and_bind (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50A8DDC: ??? (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50A9099: ??? (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50A8A22: rl_parse_and_bind (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50A8DDC: ??? (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50A8F88: rl_read_init_file (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50A2CA9: rl_initialize (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x50B5D78: ??? (in /usr/lib64/libreadline.so.8.2)
==23248==    by 0x472625: create_input_window (inputwin.c:163)
==23248==    by 0x46A613: ui_init (core.c:105)
==23248==    by 0x431BDD: _init (profanity.c:193)
==23248==    by 0x431966: prof_run (profanity.c:98)
==23248==    by 0x4EDAFC: main (main.c:174)

Again I can't see where we use readline the wrong way.

omemo

==26071== 10,136 bytes in 1 blocks are still reachable in loss record 765 of 779
==26071==    at 0x4845794: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==26071==    by 0x43958A: _xmalloc (connection.c:102)
==26071==    by 0x515AD01: ??? (in /usr/lib64/libstrophe.so.0.13.1)
==26071==    by 0x515AE93: xmpp_send_raw_string (in /usr/lib64/libstrophe.so.0.13.1)
==26071==    by 0x441F0E: iq_send_stanza (iq.c:2572)
==26071==    by 0x4EBD20: omemo_bundle_publish (omemo.c:155)
==26071==    by 0x4E5C07: omemo_publish_crypto_materials (omemo.c:339)                                                
==26071==    by 0x460407: sv_ev_connection_features_received (server_events.c:202)
==26071==    by 0x43AA87: connection_features_received (connection.c:779)
==26071==    by 0x4418C9: _disco_info_response_id_handler_onconnect (iq.c:2423)
==26071==    by 0x43B9F1: _iq_handler (iq.c:241) 
==26071==    by 0x5163848: ??? (in /usr/lib64/libstrophe.so.0.13.1)
==26071==    by 0x516A224: ??? (in /usr/lib64/libstrophe.so.0.13.1)
==26071==    by 0x5E4FE43: ??? (in /usr/lib64/libxml2.so.2.12.8)
==26071==    by 0x5E54927: xmlParseChunk (in /usr/lib64/libxml2.so.2.12.8)
==26071==    by 0x5163450: xmpp_run_once (in /usr/lib64/libstrophe.so.0.13.1)
==26071==    by 0x439797: connection_check_events (connection.c:162)
==26071==    by 0x43894E: session_process_events (session.c:256)
==26071==    by 0x4319FF: prof_run (profanity.c:128)
==26071==    by 0x4EDAFC: main (main.c:174)

pgp

==26119== 32 bytes in 1 blocks are still reachable in loss record 2,541 of 3,077
==26119==    at 0x484CA2F: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==26119==    by 0x4B2D1D5: g_malloc0 (in /usr/lib64/libglib-2.0.so.0.8000.3)
==26119==    by 0x4B10DCE: ??? (in /usr/lib64/libglib-2.0.so.0.8000.3)
==26119==    by 0x4B11C35: g_hash_table_new_full (in /usr/lib64/libglib-2.0.so.0.8000.3)
==26119==    by 0x4E1A1C: p_gpg_init (gpg.c:121)                                                                      
==26119==    by 0x431C60: _init (profanity.c:209)
==26119==    by 0x431966: prof_run (profanity.c:98)
==26119==    by 0x4EDAFC: main (main.c:174)

Allocated: https://github.com/profanity-im/profanity/blob/beeddda5685d401214d08ac65b25df74efaf5c53/src/pgp/gpg.c#L121

And freed: https://github.com/profanity-im/profanity/blob/beeddda5685d401214d08ac65b25df74efaf5c53/src/pgp/gpg.c#L131-L137

Again glib container

muc

==26119== 32 bytes in 1 blocks are still reachable in loss record 2,539 of 3,077
==26119==    at 0x484CA2F: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==26119==    by 0x4B2D1D5: g_malloc0 (in /usr/lib64/libglib-2.0.so.0.8000.3)
==26119==    by 0x4B10DCE: ??? (in /usr/lib64/libglib-2.0.so.0.8000.3)
==26119==    by 0x4B11C35: g_hash_table_new_full (in /usr/lib64/libglib-2.0.so.0.8000.3)
==26119==    by 0x4323BE: muc_init (muc.c:100)
==26119==    by 0x431C51: _init (profanity.c:202)
==26119==    by 0x431966: prof_run (profanity.c:98)
==26119==    by 0x4EDAFC: main (main.c:174)

https://github.com/profanity-im/profanity/blob/beeddda5685d401214d08ac65b25df74efaf5c53/src/xmpp/muc.c#L94-L101

Again glib container with g_hash_table_new_full.

That's enough for now. Lets try to fix those and then continue. What I'm wondering: do we really have so many mistakes in the code where we don't call the free functions? Or are we using the glib containers the wrong way and thus those free functions are never called? Does glib have memleaks (lets create a test program)?