troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux
http://troglobit.com/projects/pimd/
BSD 3-Clause "New" or "Revised" License
200 stars 90 forks source link

Issue #22 does not seem to be resolved #37

Closed mfspeer closed 9 years ago

mfspeer commented 9 years ago

I downloaded, compiled 2.2.0 and set it up in a triangle topology of 3 routers with the top node of the triangle configured as the RP. I start traffic and wait for traffic to switch to shortest path tree (first hop router and last hop router left and right hand nodes of the triangle respectively). I then simulate a link down event between the two nodes and I get same crash and stack trace previously reported for this issue:

#0 0x0805beec in add_jp_entry () 
#1 0x0805739e in age_routes () 
#2 0x0804e670 in timer () 
#3 0x080545d1 in age_callout_queue () 
#4 0x0804ee0f in main () 
mfspeer commented 9 years ago

Here's the stack trace from my crash:

[New process 93495    ]
#0  0x0806d498 in add_jp_entry (pim_nbr=0x808c248, holdtime=210, 
    group=33620448, grp_msklen=32 ' ', source=1685262346, src_msklen=32 ' ', 
    addr_flags=0, join_prune=2 '\002') at pim_proto.c:2138

warning: Source file is more recent than executable.
2138                break;
(gdb) print *pim_nbr
$1 = {next = 0x0, prev = 0x106, address = 134587968, vifi = 0, timer = 0, 
  build_jp_message = 0x1}
Current language:  auto; currently minimal
(gdb) $c
Undefined command: "$c".  Try "help".
(gdb) where
#0  0x0806d498 in add_jp_entry (pim_nbr=0x808c248, holdtime=210, 
    group=33620448, grp_msklen=32 ' ', source=1685262346, src_msklen=32 ' ', 
    addr_flags=0, join_prune=2 '\002') at pim_proto.c:2138
#1  0x080643ba in age_routes () at timer.c:713
#2  0x0805a64d in timer (i=0x0) at main.c:675
#3  0x0806014f in age_callout_queue (elapsed_time=0) at callout.c:94
#4  0x0805a5e7 in main (argc=0, argv=0x8047e50) at main.c:638
(gdb)  

This looks more like memory corruption, but I could be wrong. Prev field of pim_nbr seems to be corrupted.

idismmxiv commented 9 years ago

Interesting. Did you have switches between PIMd routers or were they directly connected? How did you caused the link down? Was it sending side pimd that crashed or receiving side? Did the multicast flow recover to work through upper part of triangel (through RP) before crash or how quickly crash occurred.

mfspeer commented 9 years ago

It's the receiving side operating with shortest path on.

troglobit commented 9 years ago

Really hope 69a5e34 fixes this bug once and for all! (If not, please reopen issue #22.)

Thanks for all the help debugging it!