rtrlib / bird-rtrlib-cli

CLI that maintains ROA table in BIRD using RTRlib
GNU Lesser General Public License v3.0
11 stars 5 forks source link

bird-rtrlib-cli crashes/stops after few hours #2

Closed job closed 8 years ago

job commented 10 years ago

Hi,

I am running bird-rtrlib-cli on three Ubuntu 14.04 amd64 servers, and on all three the cli client pointed at the IPv4 BIRD instance just stops working. Here is a full log of both STDOUT + STDERR, from start to finish http://paste.ubuntu.com/8024561/

I am running the program like this: bird-rtrlib-cli -b /var/run/bird/bird.ctl -r rpki.coloclue.net:8282 2>&1 | tee /tmp/log

Any suggestions how I can further debugging this?

mceyran commented 10 years ago

Hey, thanks for testing :D. I will look at it this evening when I'm home. Your link is not working (maybe some temporary outage), but it sounds like either a memory leak or a bad pointer that sneaked in somewhere.

job commented 10 years ago

Most be temp outage, it worked half an hour ago. The log basically shows little, except that at some point it will send "delete roa" commands for all prefixes it received from the RTR cache to bird and quit.

waehlisch commented 10 years ago

Do you use RIPE Validator or Rcynic?

job commented 10 years ago

RIPE validator, latest version

job commented 10 years ago

http://rpki.coloclue.net:8080/

waehlisch commented 10 years ago

OK. Updated our cache server to RIPE Validator 2.17. If @mceyran doesn't find anything, I think we should tcpdump the network traffic between server and client until ROAs are deleted. Just to exclude interop problems between cache server and client (I didn't expect these problems but you never know).

mceyran commented 10 years ago

It sounds like either a bug in rtrlib or normal behavior. As long as there is no exit command from the command line, the client won't exit. Maybe the RTR socket gets disconnected and rtrlib then posts deletes to the callback function which issue the delete roa commands. In this case, we should implement an auto reconnect feature. @job: Do you get back to the prompt cleanly or is the client idling after all those deletes?

job commented 10 years ago

I get back to the prompt. I can let a tcpdump run along side and get you the PCAP.

job commented 10 years ago

@mceyran If you do not have an autoreconnect feature that probably is it. My RPKI validators reboot automatically when there is security update, could very well coincidence with bird-rtrlib-cli stopping

mceyran commented 10 years ago

Sounds like a new feature is coming ;).

waehlisch commented 10 years ago

@job could you verify your assumption by checking syslogs?

job commented 10 years ago

Neither the router or the rpki validators rebooted, so that theory doesn't hold. Maybe in some other way the connection was lost and not re-established?

Another good feature: being able to specify the source IP address for the RTR connection. I'd prefer to run them from the loopbacks of my routers

waehlisch commented 10 years ago

OK. If you tcpdump the session and @mceyran will monitor our test setup, we should find the error soon.

waehlisch commented 10 years ago

@mceyran can you reproduce the error?

mceyran commented 10 years ago

Not yet, but it seems that getline() returns -1 for some reason, triggering a clean exit of the client. I am going to guard the getline() call and check for errno and see what happens.

mceyran commented 10 years ago

Running for 5 hours, nothing happened... Trying rpki.coloclue.net:8282 no-ssh now.

mceyran commented 10 years ago

./bird-rtrlib-cli -b /usr/local/var/run/bird.ctl -r rpki.coloclue.net 8282 2>&1 | tee log.txt has been working for 10 hours in a screen session now... I just pushed a version that syslogs the error string if an error has occured in getline(). @job: Can you try this out?

job commented 10 years ago

Since I am waiting for it to crash it has not crashed anymore... I will build new debian packages this weekend with your updated code and hope for the best

waehlisch commented 10 years ago

Let's wait.

waehlisch commented 10 years ago

Did the error occur again?

job commented 10 years ago

Not yet, the session I was monitoring was running over IPv6 and seems stable.

I've restarted it and now the connection between the cache and bird-rtrlib-cli is over IPv6, let's see if that makes a difference.

job commented 10 years ago

Running bird-rtrlib-cli over IPv6 to the RPKI Validator made it quit again within a few hours, unfortunatly I had a typo in my tcpdump command so didn't catch the PCAP. Now starting again..

job commented 10 years ago

OK! Got a crash

http://instituut.net/~job/rtrlib-over-ipv6.pcap

I could only reproduce it with the connection between client and RPKI validator going over IPv6.

waehlisch commented 10 years ago

@job thanks!

As far as I see cache and router are out of sync, i.e., the router sends a Serial Query, which is replied by the cache with a Cache Reset. The router then needs to send a Reset Query. In your pcap file, there are several such events - and where the RTRlib behaves correctly. However, with the latest Cache Reset message the RTRlib is not sending a Reset Query but closes the TCP connection.

Strange bug, and it's also confusing that this depends on using IPv6 for cache-router connection.

waehlisch commented 10 years ago

Update: The cache server sends several Serial Notifies before the End of Data of the last Cache Reset was sent, which leads to continuous full updates. Maybe that's the reason why the router closes the connection.

Seems that also the cache server implementation includes a bug. I opened a ticket RIPE-NCC/rpki-validator#6

smlng commented 8 years ago

Though, I thought my PR #8 might fix this one, my recent tests showed: it didn't. With Valgrind I wasn't able to reproduce the bahavior, but with GDB I had some success, so to speak. Some last lines of output before crashing:

bird-rtrlib-cli[23910]: From BIRD: 0000 
bird-rtrlib-cli[23910]: To BIRD: delete roa 58.65.160.0/19 max 19 as 23674

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7ffff6996700 (LWP 23914)]
0x00007ffff78e688d in write () at ../sysdeps/unix/syscall-template.S:81
81      ../sysdeps/unix/syscall-template.S: Datei oder Verzeichnis nicht gefunden.

So we may have to provide a handler for SIGPIPE and reconnect, if fired? Any suggestions on that?

waehlisch commented 8 years ago

Hmm, why was the socket closed? Anyhow, is probably not our fault. I agree to write handler, which caught the signal and that the program can continue.

smlng commented 8 years ago

I don't know why the socket is (temporarily) gone resulting in SIGPIPE, but as restarting bird-rtrlib-cli succeeds I assume BIRD handles this somehow and recreates the socket immediately.

However, bird-rtrlib-cli.c ignores return values (and hence possible errors) on socket read and write. I'll fix that and make a PR.

smlng commented 8 years ago

I create a PR #9 which should fix this issue.

I tested the proposed solution for >12h and it worked like charm, no crash whatsoever - communication with BIRD still intact. Though, I saw several SIGPIPEs in syslog, but #9 introduces a SIGPIPE handler and reconnects the bird_socket if needed.

Please review and test - and finally merge #9.

waehlisch commented 8 years ago

Fixed in #9