troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux
http://troglobit.com/projects/pimd/
BSD 3-Clause "New" or "Revised" License
194 stars 86 forks source link

Memory leak in PIMD on 2.3.2 #137

Closed masaraksh79 closed 5 years ago

masaraksh79 commented 5 years ago

Hi,

After issue #79 was knocked down(?) we have came back to the system after a while and found that it does introduce a tiny memory leak, growing the pimd, which would eventually cause an OOM and a sporadic reboot. We did not yet check if the release before the fix did experience this as well.

Further, we have performed an exercise with unit on a bench and replayed the capture taken from the customer's network (doesn't seem to happen if we just ran a video-cast over a link).

Replayed the capture as: tcpreplay -p 5 --intf1=enp0s8 --loop=0 --enable-file-cache mc.pcap

We have replayed the 6.1MB mc.pcap (can share if asked for) while running the pimd with valgrind. In our few hour ish test, device's free memory has managed to go down by 1MB and that was also the addition to the pimd-valgrind wrapped process size. Suspect the leak is by 16 bytes each time the capture is fed through...

The valgrind's output is down below:

< VALGRIND OUTPUT START>

/bin/valgrind --leak-check=full --show-leak-kinds=all /sbin/pimd -f -c /mmx/conf/pimd.conf ==29172== Memcheck, a memory error detector ==29172== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==29172== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==29172== Command: /sbin/pimd -f -c /mmx/conf/pimd.conf ==29172== ==29172== Conditional jump or move depends on uninitialised value(s) ==29172== at 0x400B314: ??? (in /lib/ld-2.12.2.so) ==29172== by 0xFFFFFFFF: ??? ==29172== ==29172== Conditional jump or move depends on uninitialised value(s) ==29172== at 0x400B350: ??? (in /lib/ld-2.12.2.so) ==29172== by 0xFFFFFFFF: ??? ==29172== ==29172== Conditional jump or move depends on uninitialised value(s) ==29172== at 0x400C0D4: ??? (in /lib/ld-2.12.2.so) ==29172== by 0xFFFFFFFF: ??? ==29172== ==29172== Conditional jump or move depends on uninitialised value(s) ==29172== at 0x400B1CC: ??? (in /lib/ld-2.12.2.so) ==29172== by 0xFFFFFFFF: ??? ==29172== ==29172== Conditional jump or move depends on uninitialised value(s) ==29172== at 0x400B1D4: ??? (in /lib/ld-2.12.2.so) ==29172== by 0xFFFFFFFF: ??? ==29172== ==29172== Conditional jump or move depends on uninitialised value(s) ==29172== at 0x40163CC: ??? (in /lib/ld-2.12.2.so) ==29172==

==29172== Invalid read of size 1 ==29172== at 0xBC0C: SwitchVersion (igmp_proto.c:543) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Address 0x49f9a68 is 24 bytes inside a block of size 44 free'd ==29172== at 0x4833D30: free (vg_replace_malloc.c:529) ==29172== by 0xC82F: DelVif (igmp_proto.c:795) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Block was alloc'd at ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== Invalid read of size 1 ==29172== at 0xBC28: SwitchVersion (igmp_proto.c:544) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Address 0x49f9a68 is 24 bytes inside a block of size 44 free'd ==29172== at 0x4833D30: free (vg_replace_malloc.c:529) ==29172== by 0xC82F: DelVif (igmp_proto.c:795) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Block was alloc'd at ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== Invalid write of size 1 ==29172== at 0xBC34: SwitchVersion (igmp_proto.c:544) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Address 0x49f9a68 is 24 bytes inside a block of size 44 free'd ==29172== at 0x4833D30: free (vg_replace_malloc.c:529) ==29172== by 0xC82F: DelVif (igmp_proto.c:795) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Block was alloc'd at ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== Invalid read of size 1 ==29172== at 0xBC40: SwitchVersion (igmp_proto.c:547) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Address 0x49f9a68 is 24 bytes inside a block of size 44 free'd ==29172== at 0x4833D30: free (vg_replace_malloc.c:529) ==29172== by 0xC82F: DelVif (igmp_proto.c:795) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Block was alloc'd at ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== Invalid read of size 4 ==29172== at 0xBC50: SwitchVersion (igmp_proto.c:546) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Address 0x49f9a54 is 4 bytes inside a block of size 44 free'd ==29172== at 0x4833D30: free (vg_replace_malloc.c:529) ==29172== by 0xC82F: DelVif (igmp_proto.c:795) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== Block was alloc'd at ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172==

< MANUALLY INTERRUPTING EXECUTION>

==29172== HEAP SUMMARY: ==29172== in use at exit: 531,106 bytes in 128 blocks ==29172== total heap usage: 10,353 allocs, 10,225 frees, 764,819 bytes allocated ==29172== ==29172== 16 bytes in 1 blocks are still reachable in loss record 1 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x2DB7B: add_cand_rp (rp.c:195) ==29172== by 0x2DFAF: add_rp_grp_entry (rp.c:335) ==29172== by 0x10E7F: main (main.c:575) ==29172== ==29172== 18 bytes in 1 blocks are still reachable in loss record 2 of 29 ==29172== at 0x483522C: malloc (vg_replace_malloc.c:298) ==29172== by 0x48A13B3: vasprintf (in /lib/libc-2.12.2.so) ==29172== ==29172== 20 bytes in 1 blocks are still reachable in loss record 3 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x14367: parse_rp_address (config.c:989) ==29172== by 0x15767: config_vifs_from_file (config.c:1539) ==29172== by 0x1AC6B: init_vifs (vif.c:116) ==29172== by 0x1093B: main (main.c:480) ==29172== ==29172== 20 bytes in 1 blocks are still reachable in loss record 4 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x17EA7: timer_setTimer (callout.c:136) ==29172== by 0x113BB: timer (main.c:710) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== ==29172== 20 bytes in 1 blocks are still reachable in loss record 5 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x17EA7: timer_setTimer (callout.c:136) ==29172== by 0x116AB: resetlogging (main.c:838) ==29172== by 0x17DCF: age_callout_queue (callout.c:94) ==29172== by 0x11353: main (main.c:669) ==29172== ==29172== 28 bytes in 1 blocks are still reachable in loss record 6 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x2DE33: add_grp_mask (rp.c:261) ==29172== by 0x2DF8B: add_rp_grp_entry (rp.c:316) ==29172== by 0x10E7F: main (main.c:575) ==29172== ==29172== 36 bytes in 1 blocks are still reachable in loss record 7 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x1DF47: init_pim_mrt (mrt.c:84) ==29172== by 0x1092B: main (main.c:472) ==29172== ==29172== 36 bytes in 1 blocks are still reachable in loss record 8 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x2E12B: add_rp_grp_entry (rp.c:387) ==29172== by 0x10E7F: main (main.c:575) ==29172== ==29172== 40 bytes in 1 blocks are still reachable in loss record 9 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x1DE6B: init_pim_mrt (mrt.c:67) ==29172== by 0x1092B: main (main.c:472) ==29172== ==29172== 40 bytes in 1 blocks are still reachable in loss record 10 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x2DC0F: add_cand_rp (rp.c:208) ==29172== by 0x2DFAF: add_rp_grp_entry (rp.c:335) ==29172== by 0x10E7F: main (main.c:575) ==29172== ==29172== 40 bytes in 2 blocks are still reachable in loss record 11 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xC96B: SetTimer (igmp_proto.c:831) ==29172== by 0xB0E7: accept_group_report (igmp_proto.c:315) ==29172== by 0xC06F: accept_membership_report (igmp_proto.c:643) ==29172== by 0x9BDB: accept_igmp (igmp.c:237) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 40 bytes in 2 blocks are still reachable in loss record 12 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x17EA7: timer_setTimer (callout.c:136) ==29172== by 0xCA1B: SetTimer (igmp_proto.c:844) ==29172== by 0xB0E7: accept_group_report (igmp_proto.c:315) ==29172== by 0xC06F: accept_membership_report (igmp_proto.c:643) ==29172== by 0x9BDB: accept_igmp (igmp.c:237) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 44 bytes in 1 blocks are still reachable in loss record 13 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0xC0C7: accept_membership_report (igmp_proto.c:658) ==29172== by 0x9BDB: accept_igmp (igmp.c:237) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 44 bytes in 1 blocks are still reachable in loss record 14 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xAA2F: accept_membership_query (igmp_proto.c:193) ==29172== by 0x9B4F: accept_igmp (igmp.c:220) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 80 bytes in 4 blocks are still reachable in loss record 15 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xC96B: SetTimer (igmp_proto.c:831) ==29172== by 0xB0E7: accept_group_report (igmp_proto.c:315) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 80 bytes in 4 blocks are still reachable in loss record 16 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x17EA7: timer_setTimer (callout.c:136) ==29172== by 0xCA1B: SetTimer (igmp_proto.c:844) ==29172== by 0xB0E7: accept_group_report (igmp_proto.c:315) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 88 bytes in 2 blocks are still reachable in loss record 17 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0xC06F: accept_membership_report (igmp_proto.c:643) ==29172== by 0x9BDB: accept_igmp (igmp.c:237) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 96 bytes in 3 blocks are still reachable in loss record 18 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x2241F: receive_pim_hello (pim_proto.c:162) ==29172== by 0x21313: accept_pim (pim.c:178) ==29172== by 0x210DB: pim_read (pim.c:130) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 120 bytes in 6 blocks are still reachable in loss record 19 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xC8B7: SetVersionTimer (igmp_proto.c:811) ==29172== by 0xB177: accept_group_report (igmp_proto.c:323) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 120 bytes in 6 blocks are still reachable in loss record 20 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x17EA7: timer_setTimer (callout.c:136) ==29172== by 0xC923: SetVersionTimer (igmp_proto.c:820) ==29172== by 0xB177: accept_group_report (igmp_proto.c:323) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 132 bytes in 3 blocks are still reachable in loss record 21 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xB40F: accept_group_report (igmp_proto.c:372) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 540 bytes in 27 blocks are definitely lost in loss record 22 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xC8B7: SetVersionTimer (igmp_proto.c:811) ==29172== by 0xB68F: accept_group_report (igmp_proto.c:414) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 1,020 bytes in 51 blocks are definitely lost in loss record 23 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xC8B7: SetVersionTimer (igmp_proto.c:811) ==29172== by 0xB177: accept_group_report (igmp_proto.c:323) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 2,048 bytes in 1 blocks are definitely lost in loss record 24 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x11703: config_vifs_from_kernel (config.c:104) ==29172== by 0x1ABDB: init_vifs (vif.c:107) ==29172== by 0x1093B: main (main.c:480) ==29172== ==29172== 2,052 bytes in 1 blocks are still reachable in loss record 25 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x1542B: config_vifs_from_file (config.c:1441) ==29172== by 0x1AC6B: init_vifs (vif.c:116) ==29172== by 0x1093B: main (main.c:480) ==29172== ==29172== 131,072 bytes in 1 blocks are still reachable in loss record 26 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x9527: init_igmp (igmp.c:75) ==29172== by 0x1091F: main (main.c:469) ==29172== ==29172== 131,072 bytes in 1 blocks are still reachable in loss record 27 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x9543: init_igmp (igmp.c:76) ==29172== by 0x1091F: main (main.c:469) ==29172== ==29172== 131,072 bytes in 1 blocks are still reachable in loss record 28 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x20E83: init_pim (pim.c:85) ==29172== by 0x10923: main (main.c:470) ==29172== ==29172== 131,072 bytes in 1 blocks are still reachable in loss record 29 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0x20E9F: init_pim (pim.c:86) ==29172== by 0x10923: main (main.c:470) ==29172== ==29172== LEAK SUMMARY: ==29172== definitely lost: 3,608 bytes in 79 blocks ==29172== indirectly lost: 0 bytes in 0 blocks ==29172== possibly lost: 0 bytes in 0 blocks ==29172== still reachable: 527,498 bytes in 49 blocks ==29172== suppressed: 0 bytes in 0 blocks ==29172== ==29172== For counts of detected and suppressed errors, rerun with: -v ==29172== Use --track-origins=yes to see where uninitialised values come from ==29172== ERROR SUMMARY: 292 errors from 14 contexts (suppressed: 0 from 0)

< VALGRIND OUTPUT END>

the following two are the most relevant bits from out point of view:

540 bytes in 27 blocks are definitely lost in loss record 22 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xC8B7: SetVersionTimer (igmp_proto.c:811) ==29172== by 0xB68F: accept_group_report (igmp_proto.c:414) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627) ==29172== ==29172== 1,020 bytes in 51 blocks are definitely lost in loss record 23 of 29 ==29172== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==29172== by 0xC8B7: SetVersionTimer (igmp_proto.c:811) ==29172== by 0xB177: accept_group_report (igmp_proto.c:323) ==29172== by 0x9B6B: accept_igmp (igmp.c:225) ==29172== by 0x9857: igmp_read (igmp.c:134) ==29172== by 0x1123F: main (main.c:627)

FYI - @searlernz

troglobit commented 5 years ago

Agreed, the latter two leaks seem to be the most relevant ones. The others look like missing cleanup/free that should be done atexit().

The memory is lost in the SetVersionTimer callback. All other timer callbacks do a free() at the end of their argument. How about this fix?

diff --git a/src/igmp_proto.c b/src/igmp_proto.c
index 5d6dd09..573e6b8 100644
--- a/src/igmp_proto.c
+++ b/src/igmp_proto.c
@@ -549,6 +549,8 @@ static void SwitchVersion(void *arg)

     logit(LOG_INFO, 0, "Switch IGMP compatibility mode back to v%d for group %s",
          cbk->g->al_pv, inet_fmt(cbk->g->al_addr, s1, sizeof(s1)));
+
+    free(cbk);
 }
masaraksh79 commented 5 years ago

We have run a test with suggested patch. Unfortunately the leak prevails. We did an additional debugging on the timer management code, here is the dump and valgrind report,

22:49:58.741 Received IGMP v3 Membership Report from 10.200.64.117 to 224.0.0.22 22:49:58.744 accept_membership_report(): IGMP v3 report, 16 bytes, from 10.200.64.117 to 224.0.0.22 with 1 group record. 22:49:58.746 accept_group_report(): igmp_src 10.200.64.117 ssm_src 0.0.0.0 group 239.0.2.1 report_type 34 22:49:58.750 Set delete timer for group: 239.0.2.1 22:49:58.753 find_route: Not PMBR, return NULL 22:49:58.754 Switch IGMP compatibility mode back to v3 for group 239.0.6.11 22:49:58.761 find_route: Not PMBR, return NULL ==13476== Invalid read of size 1 ==13476== at 0xBC0C: SwitchVersion (igmp_proto.c:543) ==13476== by 0x17DD7: age_callout_queue (callout.c:94) ==13476== by 0x1135B: main (main.c:669) ==13476== Address 0x4a0e3d0 is 24 bytes inside a block of size 44 free'd ==13476== at 0x4833D30: free (vg_replace_malloc.c:529) ==13476== by 0xC837: DelVif (igmp_proto.c:797) ==13476== by 0x17DD7: age_callout_queue (callout.c:94) ==13476== by 0x1135B: main (main.c:669) ==13476== Block was alloc'd at ==13476== at 0x4832D4C: calloc (vg_replace_malloc.c:710) ==13476== by 0xB40F: accept_group_report (igmp_proto.c:372) ==13476== by 0x9B6B: accept_igmp (igmp.c:225) ==13476== by 0x9857: igmp_read (igmp.c:134) ==13476== by 0x11247: main (main.c:627)

The leak is 16 bytes each time, as mentioned before. One suspect is the code where it printed "find_route: Not PMBR, return NULL" which might not be freeing up memory as it should ?

troglobit commented 5 years ago

OK, issue reopened. I'll have a look at it later this week. (Busy the next couple of days due to moving)

masaraksh79 commented 5 years ago

kind follow up @troglobit any new findings?

masaraksh79 commented 5 years ago

Another point ... since we see the leak occurs by 16 byte chunks, I've managed to see two data types of this size the dynamic allocation is made for : cand_rp_t & phaddr...

troglobit commented 5 years ago

Sorry, I've been real busy with moving and celebrations during the holidays.

I'll have a look again, but it would really help if I could reproduce the leak here and not just rely on your valgrind run. Btw, is your system 32 bit?

troglobit commented 5 years ago

I've been running the latest master during the day but cannot reproduce your leak. I may need some more information, and possibly your mc.pcap file, to get any further. I'll check with v2.3.2 tomorrow, but from what I can understand, you're not running the release version but a patched v2.3.2 right?

masaraksh79 commented 5 years ago

@troglobit the system architecture is arm v7 (32bit). Attached the mc.pcap, we have replayed it on a single device, don't recall the PIMD config used but I can get it after the 7.1. Git-wise, we used a patched 2.3.2, checked out few years ago and patched last time on Jan 2018. mc.pcap.zip

masaraksh79 commented 5 years ago

re-your comment on attempts to see this bug on the master: we've been running the same code (our patched 2.3.2) but could not reproduce the leak. We were able to only see the leak when this capture was being replayed through.

troglobit commented 5 years ago

Thanks, I'll give it (pcap) a go! :-)

troglobit commented 5 years ago

@masaraksh79 No luck so far, but I started going through your (very good!) suggestions, and I might have found something related to cand_rp_t ...

In rp.c:add_rp_grp_entry() we call add_cand_rp(), which allocates a cand_rp_t, but just below that call, around line 353, there's a for() loop that may return before the previously returned pointer is stored somewhere:

https://github.com/troglobit/pimd/blob/2.3.2/rp.c#L353

From what I can see, the call to add_cand_rp() (and the code directly following and related to that) could be moved after the for() loop to prevent the leak. This code path has changed a bit on master, but I thought it was worth mentioning.

masaraksh79 commented 5 years ago

@troglobit this makes sense, if I read the code correctly we can move lines 335-344 (inclusive) down, right below the for() loop and try to re-leak the daemon )). Would it be worth, alternatively, to switch using the master? PS. Happy NY !!! It is almost here in NZ )

troglobit commented 5 years ago

@masaraksh79 Exactly! It would be greatly appreciated if you could retry with that :smiley:

This bug report actually started me on fixing all the issues reported by Valgrind. At least the ones I can see here in my limited setup. It's a lot of small things, however, mostly related to freeing all memory at exit. Some might be of interest to you guys as well though, I'll try to provide as useful commit messages as possible.

Happy New Year! :fireworks: :champagne: :clinking_glasses:

masaraksh79 commented 5 years ago

@troglobit looks great I shall facilitate Valgrind for apps as well ))

troglobit commented 5 years ago

Awesome news, Valgrind rocks! Then we can close this issue as I've also fixed this on master in b0a0f91 :+1:

masaraksh79 commented 5 years ago

Sorry @troglobit I reckon in my words "looks great" I managed to confuse you badly. Back then, I'd just looked at your code and those changes had looked sensible. However, we have only come back from leave today and tested it but those last changes haven't solved the leak, unfortunately.

I looked at my previous report and the 16 bytes is actually a conclusion we've made (maybe invalid) based on the implanted debug messages. Next, my messag from 21 days ago,

22:49:58.744 accept_membership_report(): IGMP v3 report, 16 bytes, from 10.200.64.117 to 224.0.0.22 with 1 group record. 22:49:58.746 accept_group_report(): igmp_src 10.200.64.117 ssm_src 0.0.0.0 group 239.0.2.1 report_type 34 22:49:58.750 Set delete timer for group: 239.0.2.1 22:49:58.753 find_route: Not PMBR, return NULL 22:49:58.754 Switch IGMP compatibility mode back to v3 for group 239.0.6.11 22:49:58.761 find_route: Not PMBR, return NULL

but from this I am not allowed to conclude the leak is 16 bytes. We might have mislead ourselves here.

So, I guess we'll try to do more debugging from our end. From your end, I thought of reasons you are not reproducing this leak:

pimd.conf.zip

troglobit commented 5 years ago

OK, let's reopen this then. Thank you for following up.

Hopefully you find something with debugging on your end, otherwise I'd very much appreciate if you could try the latest master. It's got lots of memory leak fixes ... I don't want us to be chasing ghosts.

Note: master is a lot different wrt command line options, this is partly to alert users of #93 ...

masaraksh79 commented 5 years ago

@troglobit for us using the master though make sense long term might not fly in short term. Will depend on if we can point fix it on not. Our acceptance test facilities aren't available and we're facing a customer's system directly without much of local lab equipment to play with. Therefore, I prefer as less changes as possible to the existing code considering master branch as the last resort.

Re: your side debugging, if you like: change to the capture content is needed. I wasn't clear enough, all various source addresses in the capture must be rewritten to have a single source IP of your device IP (and MAC address). We just re-run the test on a single device. Very simple. The command we used is,

tcprewrite --infile=mc.pcap --outfile=mc1.pcap --srcipmap=0.0.0.0/0:10.200.64.242 --enet-smac=06:16:26:36:56:66

troglobit commented 5 years ago

@masaraksh79 Fully understand those practical limitations when you're in this phase of your project. It's just that in my sole capacity as maintainer I cannot support older releases, simply because I don't have the means to do so, so unless it's the same issue on master, you might have to track down this one on your own. Sorry.

Unfortunately I'm back full time at my regular job as of today, but I'll try to get some time over this week for my side projects in general, and pimd in particular to do some more testing, and try to replicate the issue on master with your latest input. I'd really like to close this one since (if it's an issue on master) it's a blocker for the 3.0 release which has been pending for almost three years now.

masaraksh79 commented 5 years ago

Sure. No worries. Thanks ahead

masaraksh79 commented 5 years ago

We were wrong and our test with valgrind was confusing. The patches suggested do fix the issue. We ran it on customer's system for a week and .... no leak, cheers! Thanks again, @troglobit !

troglobit commented 5 years ago

@masaraksh79 Wow, cool! Thank you so much for testing and verifying this! :fireworks: :smiley: