painless-security / trust-router

Moonshot Trust Router
0 stars 0 forks source link

Runaway trust_router processes #64

Closed jennifer-richards closed 6 years ago

jennifer-richards commented 6 years ago

Found 4-5 trust_router processes, each running at 50-100% of the CPU, after the trust router was running for a while.

Perhaps related to #4.

Had made a number of unsuccessful connection attempts, and was peered.

jennifer-richards commented 6 years ago

I believe this was fixed by #68, but I am not entirely sure- it may be that this is what happens if a deadlock or seg fault does not happen to occur immediately. I have not been able to reproduce this with that fix in place, but it was sometimes difficult to reproduce before. Open this back up if it recurs.

jennifer-richards commented 6 years ago

Test

The best I know how to test:

Steps

  1. Set up two or more trust router peers
  2. Start the trust routers
  3. Attempt repeated connections to one of the trust routers: a. using trmon b. using tidc c. by stopping and restarting the other trust router
  4. Repeat step 3 with valid and invalid credentials, with local and discovered realms, etc.
  5. Shut down the peer trust router (but not the one you're testing! leave that running!)

Expected results

When the trust router has been left alone for at least 30 seconds, running top, ps, or a system monitor should show one active (or sleeping) trust_router process and at most one other trust_router process that is in the zombie state. Running top or a system monitor should show that the trust router is using very low CPU.

A fail is:

  1. multiple zombie (or worse, active) processes
  2. high CPU load with no activity
meadmaker commented 6 years ago

I've run this with:

This last one crashed my tidc machine - it caused it to hang. (I should have anticipated that...) However, when the tidc machine hung and had to be restarted, the ninety trust_router processes that served the tidc connections remained on the trust_router server. These processes aren't using any CPU; they're just hanging out in the process table.

So, the overall CPU usage is fine, but it does leave processes behind if the other side stops communicating with us.

jennifer-richards commented 6 years ago

There is a sweep every 10 seconds (plus after every completed TID request) that should flush any processes that exited, so this probably means something is blocking the processes from exiting.

meadmaker commented 6 years ago

I'm not sure the (trust_router) processes exited. I think they are still waiting for the next communication from the no longer existing tidc processes.

jennifer-richards commented 6 years ago

Yes, that's what I expect. Normally the socket closes and whatever is waiting catches wind of that and the process exits uncleanly (but exits). If we have any processes still stuck, it'd be helpful to attach gdb to one and see where it is. Can you reproduce this with the debug symbols installed?

jennifer-richards commented 6 years ago

It's (unsurprisingly) in ReadBuffer in the gsscon code:

(gdb) bt
#0  0x00007fa338c277fd in read () from /lib64/libpthread.so.0
#1  0x0000000000423e71 in read (__nbytes=4, __buf=0x7ffdcfa9ad64, __fd=21) at /usr/include/bits/unistd.h:44
#2  ReadBuffer (inSocket=inSocket@entry=21, inBufferLength=inBufferLength@entry=4, ioBuffer=ioBuffer@entry=0x7ffdcfa9ad64 "")
    at gsscon_common.c:101
#3  0x0000000000423f2e in gsscon_read_token (inSocket=inSocket@entry=21, outTokenValue=outTokenValue@entry=0x7ffdcfa9ae50, 
    outTokenLength=outTokenLength@entry=0x7ffdcfa9ae58) at gsscon_common.c:168
#4  0x0000000000424744 in gsscon_passive_authenticate (inSocket=inSocket@entry=21, inNameBuffer=..., 
    outGSSContext=outGSSContext@entry=0x7ffdcfa9af00, clientCb=clientCb@entry=0x408800 <tr_gss_auth_cb>, 
    clientCbData=clientCbData@entry=0x1fb5860) at gsscon_passive.c:119
#5  0x00000000004089a5 in tr_gss_auth_connection (auth_cookie=0x1f115c0, auth_cb=0x405310 <tr_tids_gss_handler>, gssctx=0x7ffdcfa9af00, 
    acceptor_hostname=<optimized out>, acceptor_service=0x429333 "trustidentity", conn=21) at common/tr_gss.c:126
#6  tr_gss_handle_connection (conn=conn@entry=21, acceptor_service=acceptor_service@entry=0x429333 "trustidentity", 
    acceptor_hostname=<optimized out>, auth_cb=0x405310 <tr_tids_gss_handler>, auth_cookie=0x1f115c0, 
    req_cb=req_cb@entry=0x409e40 <tids_req_cb>, req_cookie=req_cookie@entry=0x1f0f870) at common/tr_gss.c:237
#7  0x000000000040a7c8 in tids_handle_proc (result_fd=105, conn_fd=21, tids=0x1f0f870) at tid/tids.c:424
#8  tids_accept (tids=0x1f0f870, listen=10) at tid/tids.c:487
#9  0x00007fa339dc9a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x0000000000404643 in main (argc=<optimized out>, argv=<optimized out>) at tr/tr_main.c:324
jennifer-richards commented 6 years ago

Ok, I suspect what is happening is that an unclean shutdown of a client is failing to notify the server that it has disconnected.

I have added a poll / timeout to ReadBuffer(). It's a little bit tricky to set the timeout here because I think there are some times where a trust router may legitimately be waiting for a significant period of time. E.g., I think it sits in this function while waiting for a TID request to go through. I've set it to 60 seconds for now, which I think should be long enough. We may eventually want to make this a parameter, but I'm hoping to get away without refactoring the various internal APIs that would be involved.

I committed this directly to the milestone branch because it's the easiest way to do testing on this right now. It's e42c080113a803fcbc9a18e77e3a48e643cad5d9.

jennifer-richards commented 6 years ago

Kicking back to @meadmaker to test the fix

meadmaker commented 6 years ago

Fix tested, passed.