scrapli / scrapligo

scrapli, but in go!
MIT License
258 stars 42 forks source link

fix Goroutine leak due to stuck send on channel on timeouts #204

Closed sfinke0 closed 4 days ago

sfinke0 commented 1 week ago

Hello everyone,

we're using scrapligo to gather information from devices in an API. However we noticed an increasing number of Goroutines due to some devices where the input is running into timeouts.

With pprof we saw the following goroutines hanging:

# DEBUG=1

goroutine profile: total 55
16 @ 0x4731ae 0x407c6d 0x4078d7 0x8d633c 0x47b2c1
#       0x8d633b        [github.com/scrapli/scrapligo/channel](http://github.com/scrapli/scrapligo/channel).(*Channel).SendInputB.func1+0x4fb  /src/vendor/[github.com/scrapli/scrapligo/channel/sendinput.go:80](http://github.com/scrapli/scrapligo/channel/sendinput.go:80)
# DEBUG=2

goroutine 35769 [chan send, 407 minutes]:
[github.com/scrapli/scrapligo/channel](http://github.com/scrapli/scrapligo/channel).(*Channel).SendInputB.func1()
        /src/vendor/[github.com/scrapli/scrapligo/channel/sendinput.go:80](http://github.com/scrapli/scrapligo/channel/sendinput.go:80) +0x4fc
created by [github.com/scrapli/scrapligo/channel](http://github.com/scrapli/scrapligo/channel).(*Channel).SendInputB in goroutine 33303
        /src/vendor/[github.com/scrapli/scrapligo/channel/sendinput.go:35](http://github.com/scrapli/scrapligo/channel/sendinput.go:35) +0x2f4

goroutine 101718 [chan send, 48 minutes]:
[github.com/scrapli/scrapligo/channel](http://github.com/scrapli/scrapligo/channel).(*Channel).SendInputB.func1()
        /src/vendor/[github.com/scrapli/scrapligo/channel/sendinput.go:80](http://github.com/scrapli/scrapligo/channel/sendinput.go:80) +0x4fc
created by [github.com/scrapli/scrapligo/channel](http://github.com/scrapli/scrapligo/channel).(*Channel).SendInputB in goroutine 99320
        /src/vendor/[github.com/scrapli/scrapligo/channel/sendinput.go:35](http://github.com/scrapli/scrapligo/channel/sendinput.go:35) +0x2f4

To fix the issue I decided to ditch the timer for a context with timeout. This makes it possible to completely get rid of the select and make sure that the channel is always read. I also added an additional return statement, so that in rare cases the goroutine does not try to write into the channel twice.

Please let me know if you have any suggestions for improvement.

Thanks!

Sebastian

carlmontanari commented 1 week ago

hey @sfinke0 thanks for this! on a quick look this seems great to me -- ill probably take a closer look this weekend -- I think there are probably a few other places that this could also apply that would maybe be good to update (probably send interactive, maybe also send with callbacks).

cc @netixx for viz too since they did a bunch of other fixes around some leaky goroutines earlier this year!

carlmontanari commented 6 days ago

👋 hey again @sfinke0 -- I just pushed a commit to do how you set this up on get prompt, sendinteractive and netconf caps reading.

for context/for future us to remember what the heck was going on here -- this used to be setup with the timeout handler due to blocking reads in channel. but that went away with the channel read loop that runs in the background of things. so your fix is a much better way to handle things now! I left the netconf npc and send with callbacks stuff alone for now though those should probably also be updated at some point.

if this still looks good to you let me know and ill smack merge! thanks a bunch for the help!

sfinke0 commented 4 days ago

Hi @carlmontanari,

thanks for the improvements, I really like the use of errors.Is because this preserves the original error message. LGTM for merge :blush:

Thanks again!