Closed jennifer-richards closed 6 years ago
I believe this was fixed by #68, but I am not entirely sure- it may be that this is what happens if a deadlock or seg fault does not happen to occur immediately. I have not been able to reproduce this with that fix in place, but it was sometimes difficult to reproduce before. Open this back up if it recurs.
The best I know how to test:
trmon
b. using tidc
c. by stopping and restarting the other trust routerWhen the trust router has been left alone for at least 30 seconds, running top
, ps
, or a system monitor should show one active (or sleeping) trust_router process and at most one other trust_router process that is in the zombie state. Running top
or a system monitor should show that the trust router is using very low CPU.
A fail is:
I've run this with:
trmon
- two windows which each run trmon ... show
100 times eachtidc
tidc
100 times eachtidc
250 times, in groups of five parallel processestidc
in the background 100 timesThis last one crashed my tidc
machine - it caused it to hang. (I should have anticipated that...) However, when the tidc
machine hung and had to be restarted, the ninety trust_router
processes that served the tidc
connections remained on the trust_router server. These processes aren't using any CPU; they're just hanging out in the process table.
So, the overall CPU usage is fine, but it does leave processes behind if the other side stops communicating with us.
There is a sweep every 10 seconds (plus after every completed TID request) that should flush any processes that exited, so this probably means something is blocking the processes from exiting.
I'm not sure the (trust_router
) processes exited. I think they are still waiting for the next communication from the no longer existing tidc
processes.
Yes, that's what I expect. Normally the socket closes and whatever is waiting catches wind of that and the process exits uncleanly (but exits). If we have any processes still stuck, it'd be helpful to attach gdb to one and see where it is. Can you reproduce this with the debug symbols installed?
It's (unsurprisingly) in ReadBuffer
in the gsscon code:
(gdb) bt
#0 0x00007fa338c277fd in read () from /lib64/libpthread.so.0
#1 0x0000000000423e71 in read (__nbytes=4, __buf=0x7ffdcfa9ad64, __fd=21) at /usr/include/bits/unistd.h:44
#2 ReadBuffer (inSocket=inSocket@entry=21, inBufferLength=inBufferLength@entry=4, ioBuffer=ioBuffer@entry=0x7ffdcfa9ad64 "")
at gsscon_common.c:101
#3 0x0000000000423f2e in gsscon_read_token (inSocket=inSocket@entry=21, outTokenValue=outTokenValue@entry=0x7ffdcfa9ae50,
outTokenLength=outTokenLength@entry=0x7ffdcfa9ae58) at gsscon_common.c:168
#4 0x0000000000424744 in gsscon_passive_authenticate (inSocket=inSocket@entry=21, inNameBuffer=...,
outGSSContext=outGSSContext@entry=0x7ffdcfa9af00, clientCb=clientCb@entry=0x408800 <tr_gss_auth_cb>,
clientCbData=clientCbData@entry=0x1fb5860) at gsscon_passive.c:119
#5 0x00000000004089a5 in tr_gss_auth_connection (auth_cookie=0x1f115c0, auth_cb=0x405310 <tr_tids_gss_handler>, gssctx=0x7ffdcfa9af00,
acceptor_hostname=<optimized out>, acceptor_service=0x429333 "trustidentity", conn=21) at common/tr_gss.c:126
#6 tr_gss_handle_connection (conn=conn@entry=21, acceptor_service=acceptor_service@entry=0x429333 "trustidentity",
acceptor_hostname=<optimized out>, auth_cb=0x405310 <tr_tids_gss_handler>, auth_cookie=0x1f115c0,
req_cb=req_cb@entry=0x409e40 <tids_req_cb>, req_cookie=req_cookie@entry=0x1f0f870) at common/tr_gss.c:237
#7 0x000000000040a7c8 in tids_handle_proc (result_fd=105, conn_fd=21, tids=0x1f0f870) at tid/tids.c:424
#8 tids_accept (tids=0x1f0f870, listen=10) at tid/tids.c:487
#9 0x00007fa339dc9a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x0000000000404643 in main (argc=<optimized out>, argv=<optimized out>) at tr/tr_main.c:324
Ok, I suspect what is happening is that an unclean shutdown of a client is failing to notify the server that it has disconnected.
I have added a poll / timeout to ReadBuffer()
. It's a little bit tricky to set the timeout here because I think there are some times where a trust router may legitimately be waiting for a significant period of time. E.g., I think it sits in this function while waiting for a TID request to go through. I've set it to 60 seconds for now, which I think should be long enough. We may eventually want to make this a parameter, but I'm hoping to get away without refactoring the various internal APIs that would be involved.
I committed this directly to the milestone branch because it's the easiest way to do testing on this right now. It's e42c080113a803fcbc9a18e77e3a48e643cad5d9.
Kicking back to @meadmaker to test the fix
Fix tested, passed.
Found 4-5 trust_router processes, each running at 50-100% of the CPU, after the trust router was running for a while.
Perhaps related to #4.
Had made a number of unsuccessful connection attempts, and was peered.