opensourcerouting / frr

Free Range Routing Protocol Suite
Other
37 stars 12 forks source link

ISIS: Missing routes unless `debug isis snp-packets` is enabled #5

Closed mwinter-osr closed 7 years ago

mwinter-osr commented 7 years ago

This happens for broadcast, FRR=DIS with a simulated broadcast topology

If there are topologies with more than ~5000 routes, then some ISIS LSPs seem to get dropped. Enabling debug isis snp-packets fixes the issue.

Retested on git 9cdce03 (2017-08-08)

Same topology

Output without the debug:

osr-perf3# show ip route summ
Route Source         Routes               FIB  (vrf Default-IP-Routing-Table)
kernel               1                    1                    
connected            5                    5                    
static               1                    1                    
isis                 6970                 6969                 
------
Totals               6977                 6976                 

and with the debug (correct):

osr-perf3# show ip route summ
Route Source         Routes               FIB  (vrf Default-IP-Routing-Table)
kernel               1                    1                    
connected            5                    5                    
static               1                    1                    
isis                 7073                 7072                 
------
Totals               7080                 7079                 

Doing a binary search through the code, it seems to be the debug section in in isis_pdu.c - lines 1780-1792:

if (isis->debugs & DEBUG_SNP_PACKETS) {
        zlog_debug(
                "ISIS-Snp (%s): Sending L%d CSNP on %s, length %zd",
                circuit->area->area_tag, level,
                circuit->interface->name,
                stream_get_endp(circuit->snd_stream));
        log_multiline(LOG_DEBUG, "              ", "%s",
                      isis_format_tlvs(tlvs));
        if (isis->debugs & DEBUG_PACKET_DUMP)
                zlog_dump_data(
                        STREAM_DATA(circuit->snd_stream),
                        stream_get_endp(circuit->snd_stream));
}

Just adding a isis_format_tlvs(tlvs) outside the debug (to be always executed) seems to fix the issue. No idea why... (Didn't look further into it)

cfra commented 7 years ago

As the saying goes, a call to isis_format_tlvs SHOULD not make a difference, but in practice, it seems to do so. Apart from setting up an output buffer, the function only calls format_tlvs. I would suggest to verify whether a call isis_format_tlvs still fixes the issue if the call to format_tlvs is commented out.

I would expect that it doesn't. If this is indeed the case, some part of format_tlvs seems to fix the problem. In that case, reducing format_tlvs step by step to figure out which part fixes the issue would seem the best option.

Could you either look into that or provide me with instructions on how I can test this myself?

mwinter-osr commented 7 years ago

Correct assumption: If I comment out format_tlvs then it doesn't work anymore.

Trying to dig more into details now...

mwinter-osr commented 7 years ago

Digging (read: binary search) into format_tlvs() shows the following call to be responsible to make it all work (5th format_items() call):

format_items(ISIS_CONTEXT_LSP, ISIS_TLV_LSP_ENTRY, &tlvs->lsp_entries,
         buf, indent);

I've commented all other calls out and it's still working

I've then commented JUST this specific format_items() calls out and can confirm that I have the missing routes again.

cfra commented 7 years ago

I am a bit at a loss of why this should make any difference. The TLVs have already been serialized into a CSNP pdu before the call to format_items, so even if the TLVs changed through the call (which they don't/shouldn't), it would not make a difference. You could try commenting out the body of format_item_lsp_entry and check whether the routes are still missing.

Somehow, I have the growing feeling that this issue is somehow related to timing or a memory corruption issue which I did not spot yet. Because from the obvious/intended effects of the code, I cannot explain why it makes a difference. :/

Could you try replacing the isis_format_tlvs call with an usleep(5000)?

Also, it might shed some light on things to compare the CSNP packets which are sent in the working case and in the broken case. Maybe the effect of the isis_format_tlvs call becomes apparent in this comparison.

mwinter-osr commented 7 years ago

ok, a simple usleep(5000) (instead of the isis_format_tlvs) seems to fix the issue.

I'm going to try compiling it with the address sanitizer and see if this finds any memory corruption.

mwinter-osr commented 7 years ago

No luck with Address Sanitizer. Tried it with gcc-5.4.0 (default Ubuntu 16.04 gcc) and gcc-7.1.0 and in both cases no memory corruption reported (but still had missing routes with asan enabled)

mwinter-osr commented 7 years ago

Comparing the output of a good run (using the delay) and a bad run: Good: https://drive.google.com/open?id=0B8W_T0dxQfwxYUg2ZWNQVzZQZnc Bad: https://drive.google.com/open?id=0B8W_T0dxQfwxYXNJWnFUdl9ibDg

The output shows missing entries in the ISIS database for the case with the missing routes

cfra commented 7 years ago

Is there a way I can access to the testbed for running this test? I have different assumptions why this delay could be fixing things, and it is probably easier if I can just asses them on my own instead of us having to go back and forth.

cfra commented 7 years ago

I am off for a week vacation and will continue work on this when I return.

mwinter-osr commented 7 years ago

Moving issue to frrouting issue https://github.com/FRRouting/frr/issues/980 so it gets tracked