scrapli / scrapligo

scrapli, but in go!
MIT License
244 stars 35 forks source link

Netconf channel stuck in ReadUntilPrompt #178

Closed netixx closed 2 months ago

netixx commented 2 months ago

Hello,

In my quest to squash all remaining memory leaks, I found one more: image

I am now using the Custom Transport (which in my case is more or less the same as standard transport), and I am seeing some goroutines stuck in ReadUntilPrompt, which in turn is leaving some memory unreleased.

It looks like it's stuck in an infinite loop here:

// ReadUntilPrompt reads bytes out of the channel Q object until the channel PromptPattern regex
// pattern is seen in the output. Once that pattern is seen, all read bytes are returned.
func (c *Channel) ReadUntilPrompt() ([]byte, error) {
    var rb []byte

    for {
        nb, err := c.Read()
        if err != nil {
            return nil, err
        }

        if nb == nil {
            time.Sleep(c.ReadDelay)

            continue
        }

        rb = append(rb, nb...)

        if c.PromptPattern.Match(processReadBuf(rb, c.PromptSearchDepth)) {
            return rb, nil
        }
    }
}

I was expecting the ReadUntilPrompt to respect options.WithTimeoutOps(120 * time.Second), parameter, but looking at the code it doesn't seem to actually apply to the Open phase (which is causing the issue in my case).

I was thinking we could start a timer (or a context) like in GetPrompt() ? And maybe this should apply for all Read call (so that no call to Read can last more than timeout seconds).

This made me think about contexts as well, shouldn't all the network bound operations of scrapligo accept a context as a first argument ? That way we can control as a user with context.WithTimeout what duration we all for each phase/call ?

carlmontanari commented 2 months ago

bam, another one :D haha we'll get them all cleaned up one day 🤣

makes sense and I will tackle this this week and report back! regarding the context thing... yea... it probably "should" be done like that to be idiomatic. it isn't because it was more or less 1:1 translated from python when I first wrote things. I also had some notion that I wanted it to stay as similar/same as python things.

so.. for now I think I wont do anything about this context business, but it is on my mind/radar for sure.

netixx commented 2 months ago

okay :) Honestly I guessed that you already thought about contexts, and had a reason not to have done it yet, but I figured I should mention it anyways for completeness.

My guess is that there are probably other Channel operations that could also use a timer to prevent indefinite execution time, but I am not sure which ones...

If it helps I could write a test for that (emulating a verrrry slow ssh peer) ?

carlmontanari commented 2 months ago

yeah, tests are always welcome of course!

I think in general there is just timers on the get prompt and send input methods and nothing/not much else (maybe auth I guess). which I guess leads to the question: are you calling read until prompt directly or via something else?

I guess (thinking quickly so this is maybe very wrong) we need to send a done channel into readuntilprompt/any/etc. so that when we time them out from something like "send input" we can send the done signal to them. lemme know if that makes sense/sounds reasonable :)

netixx commented 2 months ago

I am not calling it directly, the call graph is Open > processServerCapabilities > ReadUntilPrompt.

func (d *Driver) processServerCapabilities() error {
    b, err := d.Channel.ReadUntilPrompt()
    if err != nil {
        return err
    }

It's actually possible that it's the router that never finishes it's capabilities (we've seen that before: https://github.com/carlmontanari/scrapli/issues/233), but nevertheless, we should terminate open before the defined timeout.

netixx commented 2 months ago

I pushed a test and a couple of test-fixtures to https://github.com/netixx/scrapligo/blob/fix/open-timeout/driver/netconf/open_test.go

I set timeoutOps to 1 second, and give the fake driver 5s to complete the open/close sequence.

I added one test for fail (truncated capabilities) and one test for success using the start of one of the other test-fixture.

carlmontanari commented 2 months ago

sorry for the delay! lets follow up in #179 -- I cherry picked your tests and tweaked a bit, but it passes nicely now and I think I got all the other read until X stuff properly closing out now too