refraction-networking / conjure

Conjure Refraction Networking station code
https://refraction.network
Apache License 2.0
66 stars 19 forks source link

Crash Debug #190

Closed jmwample closed 1 year ago

jmwample commented 1 year ago

One of the main stations crashed overnight seeminlgly due to thread starvation in the proxy.halfPipe() function. In the crash logs in syslog it says: oroutine 6_933_001 [semacquire, 4 minutes]: which makes me things that there may have been ~7 million goroutines running.


Jun 29 00:13:14 : conjure-app.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 29 00:13:13 :         /opt/conjure/application/conns.go:66 +0x2fb
Jun 29 00:13:13 : created by main.(*connManager).acceptConnections
Jun 29 00:13:13 :         /opt/conjure/application/conns.go:112 +0x2f2
Jun 29 00:13:13 : main.(*connManager).handleNewConn(0xc025e3da40?, 0x0?, 0xc024efcdd8)
Jun 29 00:13:13 :         /opt/conjure/application/conns.go:327 +0x1bb9
Jun 29 00:13:13 : main.(*connManager).handleNewTCPConn(0xc0001f65b0, 0xc0000be460, {0xa3afb8, 0xc024efcdd8}, {0xc00d49dda0, 0x10, 0x10})
Jun 29 00:13:13 :         /opt/conjure/application/lib/proxies.go:262 +0x7a5
Jun 29 00:13:13 : github.com/refraction-networking/conjure/application/lib.Proxy(0xc008ce2700, {0xa3b0c0, 0xc01f95b240}, 0xc00254fdb0)
Jun 29 00:13:13 :         /usr/local/go/src/sync/waitgroup.go:116 +0x4b
Jun 29 00:13:13 : sync.(*WaitGroup).Wait(0xc008ce2700?)
Jun 29 00:13:13 :         /usr/local/go/src/runtime/sema.go:62 +0x27
Jun 29 00:13:13 : sync.runtime_Semacquire(0xc019ab7b00?)
Jun 29 00:13:13 : goroutine 6933001 [semacquire, 4 minutes]:
Jun 29 00:13:13 :         /opt/conjure/application/conns.go:66 +0x2fb
Jun 29 00:13:13 : created by main.(*connManager).acceptConnections
Jun 29 00:13:13 :         /opt/conjure/application/conns.go:112 +0x2f2
Jun 29 00:13:13 : main.(*connManager).handleNewConn(0x80ffaa?, 0x0?, 0xc00efb2890)
Jun 29 00:13:13 :         /opt/conjure/application/conns.go:327 +0x1bb9
Jun 29 00:13:13 : main.(*connManager).handleNewTCPConn(0xc0001f65b0, 0xc0000be460, {0xa3afb8, 0xc00efb2890}, {0xc016ea42f0, 0x10, 0x10})
Jun 29 00:13:13 :         /opt/conjure/application/lib/proxies.go:262 +0x7a5
Jun 29 00:13:13 : github.com/refraction-networking/conjure/application/lib.Proxy(0xc023754fc0, {0xa3b0c0, 0xc01f1c6a20}, 0xc00a19ac30)
Jun 29 00:13:13 :         /usr/local/go/src/sync/waitgroup.go:116 +0x4b
Jun 29 00:13:13 : sync.(*WaitGroup).Wait(0xc023754fc0?)
Jun 29 00:13:13 :         /usr/local/go/src/runtime/sema.go:62 +0x27
Jun 29 00:13:13 : sync.runtime_Semacquire(0xc005d7b260?)
Jun 29 00:13:13 : goroutine 6716000 [semacquire, 13 minutes]:

test solutions

ewust commented 1 year ago

Not sure if we definitely have 7M goroutines, or if go is just numerically incrementing go routine IDs? E.g. if you do something like

go func() { fmt.Println("A")  }()
sleep(1)
go func() { fmt.Println("B") }()

does B get printed from goroutine 1 or goroutine 2?

In any case, this seems like a good idea to forward FINs like this. One thing to be careful of is double closing (e.g. two halfPipes for (src, dst) and (dst, src) don't each need to close both connections in their defers, but it looks like you handle this by just closing dst in the defer for halfPipe

jmwample commented 1 year ago

One thing to be careful of is double closing (e.g. two halfPipes for (src, dst) and (dst, src) don't each need to close both connections in their defers, but it looks like you handle this by just closing dst in the defer for halfPipe

I actually try to close both on both sides. We could add a channel that indicated to the other conn to close the dst, but I think that in golang the worst that happens with a double close is a return net.ErrClosed error.

does B get printed from goroutine 1 or goroutine 2?

I don't know the answer to this. It is possible that they increment globally over time and that in total we had used 7M, but that doesn't mean that there were 7M live at one time. I am not sure how to get goroutine IDs. I did add a count of the current number of goroutines in use by the application to the stats logging, so we can watch over time.

jmwample commented 1 year ago

does B get printed from goroutine 1 or goroutine 2?

Seems they are globally incrementing within the process according to this SO thread. So it would probably be 2 (or more since go probably uses some under the hood).

jmwample commented 1 year ago

The ultimate issue seems to have been a race condition in connection handling which then caused thread starvation elsewhere. Fixes to the proxy halfPipes are probably still a good thing.