Closed job closed 8 years ago
Hey, thanks for testing :D. I will look at it this evening when I'm home. Your link is not working (maybe some temporary outage), but it sounds like either a memory leak or a bad pointer that sneaked in somewhere.
Most be temp outage, it worked half an hour ago. The log basically shows little, except that at some point it will send "delete roa" commands for all prefixes it received from the RTR cache to bird and quit.
Do you use RIPE Validator or Rcynic?
RIPE validator, latest version
OK. Updated our cache server to RIPE Validator 2.17. If @mceyran doesn't find anything, I think we should tcpdump the network traffic between server and client until ROAs are deleted. Just to exclude interop problems between cache server and client (I didn't expect these problems but you never know).
It sounds like either a bug in rtrlib or normal behavior. As long as there is no exit command from the command line, the client won't exit. Maybe the RTR socket gets disconnected and rtrlib then posts deletes to the callback function which issue the delete roa commands. In this case, we should implement an auto reconnect feature. @job: Do you get back to the prompt cleanly or is the client idling after all those deletes?
I get back to the prompt. I can let a tcpdump run along side and get you the PCAP.
@mceyran If you do not have an autoreconnect feature that probably is it. My RPKI validators reboot automatically when there is security update, could very well coincidence with bird-rtrlib-cli stopping
Sounds like a new feature is coming ;).
@job could you verify your assumption by checking syslogs?
Neither the router or the rpki validators rebooted, so that theory doesn't hold. Maybe in some other way the connection was lost and not re-established?
Another good feature: being able to specify the source IP address for the RTR connection. I'd prefer to run them from the loopbacks of my routers
OK. If you tcpdump the session and @mceyran will monitor our test setup, we should find the error soon.
@mceyran can you reproduce the error?
Not yet, but it seems that getline()
returns -1
for some reason, triggering a clean exit of the client. I am going to guard the getline()
call and check for errno
and see what happens.
Running for 5 hours, nothing happened... Trying rpki.coloclue.net:8282 no-ssh now.
./bird-rtrlib-cli -b /usr/local/var/run/bird.ctl -r rpki.coloclue.net 8282 2>&1 | tee log.txt
has been working for 10 hours in a screen session now... I just pushed a version that syslogs the error string if an error has occured in getline()
. @job: Can you try this out?
Since I am waiting for it to crash it has not crashed anymore... I will build new debian packages this weekend with your updated code and hope for the best
Let's wait.
Did the error occur again?
Not yet, the session I was monitoring was running over IPv6 and seems stable.
I've restarted it and now the connection between the cache and bird-rtrlib-cli is over IPv6, let's see if that makes a difference.
Running bird-rtrlib-cli over IPv6 to the RPKI Validator made it quit again within a few hours, unfortunatly I had a typo in my tcpdump command so didn't catch the PCAP. Now starting again..
OK! Got a crash
http://instituut.net/~job/rtrlib-over-ipv6.pcap
I could only reproduce it with the connection between client and RPKI validator going over IPv6.
@job thanks!
As far as I see cache and router are out of sync, i.e., the router sends a Serial Query, which is replied by the cache with a Cache Reset. The router then needs to send a Reset Query. In your pcap file, there are several such events - and where the RTRlib behaves correctly. However, with the latest Cache Reset message the RTRlib is not sending a Reset Query but closes the TCP connection.
Strange bug, and it's also confusing that this depends on using IPv6 for cache-router connection.
Update: The cache server sends several Serial Notifies before the End of Data of the last Cache Reset was sent, which leads to continuous full updates. Maybe that's the reason why the router closes the connection.
Seems that also the cache server implementation includes a bug. I opened a ticket RIPE-NCC/rpki-validator#6
Though, I thought my PR #8 might fix this one, my recent tests showed: it didn't. With Valgrind I wasn't able to reproduce the bahavior, but with GDB I had some success, so to speak. Some last lines of output before crashing:
bird-rtrlib-cli[23910]: From BIRD: 0000
bird-rtrlib-cli[23910]: To BIRD: delete roa 58.65.160.0/19 max 19 as 23674
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7ffff6996700 (LWP 23914)]
0x00007ffff78e688d in write () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: Datei oder Verzeichnis nicht gefunden.
So we may have to provide a handler for SIGPIPE
and reconnect, if fired? Any suggestions on that?
Hmm, why was the socket closed? Anyhow, is probably not our fault. I agree to write handler, which caught the signal and that the program can continue.
I don't know why the socket is (temporarily) gone resulting in SIGPIPE, but as restarting bird-rtrlib-cli
succeeds I assume BIRD handles this somehow and recreates the socket immediately.
However, bird-rtrlib-cli.c ignores return values (and hence possible errors) on socket read and write. I'll fix that and make a PR.
I create a PR #9 which should fix this issue.
I tested the proposed solution for >12h and it worked like charm, no crash whatsoever - communication with BIRD still intact. Though, I saw several SIGPIPEs in syslog, but #9 introduces a SIGPIPE handler and reconnects the bird_socket if needed.
Please review and test - and finally merge #9.
Fixed in #9
Hi,
I am running bird-rtrlib-cli on three Ubuntu 14.04 amd64 servers, and on all three the cli client pointed at the IPv4 BIRD instance just stops working. Here is a full log of both STDOUT + STDERR, from start to finish http://paste.ubuntu.com/8024561/
I am running the program like this:
bird-rtrlib-cli -b /var/run/bird/bird.ctl -r rpki.coloclue.net:8282 2>&1 | tee /tmp/log
Any suggestions how I can further debugging this?