Memory not released in channel read (maybe :) )

netixx commented 3 months ago

Hello,

I am running a REST API that runs netconf in the backend.

I am investigating an issue where my app does not release a big chunk of it's memory.

I used profiling to identify what is holding the memory, and I got the following:

It seems that the read function is holding a lot of memory, event though there are no more queries being processed. I am not sure why it doesn't get cleaned up by the garbage collector though. Profiling suggests that it's somewhere around here: https://github.com/scrapli/scrapligo/blob/ed0178113d9d227f9cef86dc42001f4f2da48d4c/channel/read.go#L43.

My first guess is that after c.Q.Enqueue(b) b should be wiped ?

Do you have any idea why ? Would you have some time to troubleshoot together ?

netixx commented 3 months ago

Here is the bit of code I am using to call the driver:

// query makes sure we don't do multiple simultaneous queries to the same host.
func (p *NetconfProxy) query(q Query) (string, error) {
    conn, err := p.driverFactory(q.hostname)
    if err != nil {
        return "", fmt.Errorf("error during netconf connection to %s: %w", q.hostname, err)
    }
    if err := conn.Open(); err != nil {
        return "", fmt.Errorf("cannot open connection to %s: %w", q.hostname, err)
    }
    defer conn.Close()

    if !conn.Transport.IsAlive() {
        return "", fmt.Errorf("transport is dead after driver creation for %s", q.hostname)
    }

    var result *response.NetconfResponse
    switch q.queryType {
    case NetconfQueryTypeGet:
        result, err = conn.Get(q.query)
    case NetconfQueryTypeGetConfig:
        // config is long, increase timeout
        result, err = conn.GetConfig(q.query)
    case NetconfQueryTypeRPC:
        result, err = conn.RPC(opoptions.WithFilter(q.query))
    default:
        return "", fmt.Errorf("unknown query type %s", q.queryType)
    }

    if err != nil {
        return "", fmt.Errorf("error during netconf query call: %w", err)
    }

    if result.Failed != nil {
        return result.Result, fmt.Errorf("error during netconf query: %w", result.Failed)
    }

    return result.Result, nil
}

// driverFactory returns an opened netconf driver connection, ready to be queried.
func (p *NetconfProxy) driverFactory(hostname string) (*netconf.Driver, error) {
    // f, _ := os.Create("scrapli.log")
    logger, _ := logging.NewInstance(
        logging.WithLevel(logging.Critical),
        logging.WithLogger(p.log.Warn),
    )
    opts := []util.Option{
        // reduce CPU usage because of polling loop
        options.WithReadDelay(10 * time.Millisecond),
        options.WithPort(22),
        options.WithAuthNoStrictKey(),
        options.WithAuthUsername(p.bastionUsername),
        options.WithAuthPrivateKey(p.bastionPrivateKeyPath, ""),
        options.WithTransportType("system"),
        options.WithSystemTransportOpenArgs([]string{
            "-t", "--", "-P",
            hostname,
            "--netconf", "-p", "830",
        }),
        options.WithLogger(logger),
        options.WithTimeoutOps(120 * time.Second),
        // options.WithChannelLog(f),
    }
    // juniper has a bug which make it mandatory to use self closing tags
    // see https://github.com/scrapli/scrapligo/issues/73 for details
    if strings.HasSuffix(hostname, "-ptx") {
        opts = append(
            opts,
            options.WithNetconfForceSelfClosingTags(),
        )
    }
    return netconf.NewDriver(
        p.bastionFQDN,
        opts...,
    // options.WithChannelLog(f),
    // options.WithLogger(logger),
    )
}

Note that I always call conn.Close(), the only case where I don't do it is when conn.Open() fails, but that seemed ok when I tested it.

carlmontanari commented 3 months ago

woof. that's not good :)

yea lemme try to see if I can reproduce and see what's up. maybe you can try to just b = nil after he enqueue as that sounds reasonable and a quick search seems to agree w/ you that the gc cant know about if it can or cant reap the slice until its "nil'd" out/goes all the way out of scope.

netixx commented 3 months ago

I tried setting the b to nil in channel.read, but it didn't solve the issue.

I think maybe because bytes.ReplaceAll returns a new slice, we need to define a new value for the result and nil the input after calling bytes.ReplaceAll ?

(For sake of completeness, I think rb should also be set to nil in the driver: https://github.com/scrapli/scrapligo/compare/main...netixx:scrapligo:fix/b-memory-usage?expand=1)

carlmontanari commented 3 months ago

you've got good timing - was just going to report that the nil didn't change anything for my testing too! but I also didn't seem to really reproduce things either. running for 10m getting config every 1s I ended up w/ ~ 0.3MB usage -- when I tested shorter duration it was about the same as well which seems generally "good" but confusing in this context I think.

will check out the rb/replace all stuff now as well.

in both the graphs there is like > 10 minutes where memory is totally chill. is there nothing going on in that part? what happens at that 10 minute mark? is there anything else in this container?

Also, seems like in my case the memory is getting chewed up in the regex bits (which feels reasonable) but only in the NETCONF driver read not in the channel bits.

Screenshot 2024-03-28 at 9 52 50 AM

netixx commented 3 months ago

There are indeed other things in the container, mainly a web server (gin).

The app receives query every 3 hours, once the queries are handled, it more or less sits there waiting for the next few hours (I have another app that does refresh every 3 hours for inventory purposes). When I focus the memory usage on the waiting time (when there are NO queries being processed), I find the following usage:

Which means that a major part of the memory is being used by the read function (but really by the bytes.Replace function), but I expect those to be release once the queries are done, especially since I do a conn.Close() after each query.

I built scrapligo with -gcflags "-m", and got the following regarding channel/read.go, maybe it can help:

channel/read.go:15:6: can inline getProcessReadBufSearchDepth
channel/read.go:27:21: leaking param: rb to result ~r0 level=0
channel/read.go:34:42: ([]byte)("\n") does not escape
channel/read.go:43:7: leaking param content: c
channel/read.go:51:21: inlining call to transport.(*Transport).Read
channel/read.go:55:17: ... argument does not escape
channel/read.go:80:23: inlining call to bytes.ReplaceAll
channel/read.go:80:34: ([]byte)("\r") does not escape
channel/read.go:80:48: ([]byte)("") does not escape
channel/read.go:80:48: zero-copy string->[]byte conversion
channel/read.go:81:20: inlining call to bytes.Contains
channel/read.go:81:31: ([]byte)("\x1b") does not escape
channel/read.go:90:18: ... argument does not escape
channel/read.go:102:7: leaking param content: c
channel/read.go:115:12: ... argument does not escape
channel/read.go:115:39: string(b) escapes to heap
channel/read.go:115:40: string(b) escapes to heap
channel/read.go:125:7: leaking param content: c
channel/read.go:138:12: ... argument does not escape
channel/read.go:138:39: string(b) escapes to heap
channel/read.go:138:40: string(b) escapes to heap
channel/read.go:144:7: leaking param content: c
channel/read.go:144:34: b does not escape
channel/read.go:167:51: inlining call to getProcessReadBufSearchDepth
channel/read.go:176:7: leaking param content: c
channel/read.go:176:37: b does not escape
channel/read.go:193:20: inlining call to bytes.Contains
channel/read.go:194:51: inlining call to getProcessReadBufSearchDepth
channel/read.go:204:7: leaking param content: c
channel/read.go:221:27: inlining call to regexp.(*Regexp).Match
channel/read.go:221:27: inlining call to regexp.(*Regexp).doMatch
channel/read.go:229:7: leaking param content: c
channel/read.go:229:38: leaking param content: prompts
channel/read.go:249:14: inlining call to regexp.(*Regexp).Match
channel/read.go:249:14: inlining call to regexp.(*Regexp).doMatch

This one caught my attention:

channel/read.go:80:48: zero-copy string->[]byte conversion

I don't see any string to start with...

I also was thinking that this could be an issue: https://stackoverflow.com/questions/34436863/can-this-type-of-golang-string-slicing-leak-memory-in-underlying-byte-array. But since we expect to totally clear the b when we close the connection, it should get released ?

By the way, it does run OOM:

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 02 Apr 2024 03:07:52 +0200
      Finished:     Tue, 02 Apr 2024 09:13:54 +0200
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:                400m
      ephemeral-storage:  200M
      memory:             256Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  100M
      memory:             128Mi
    Environment:
      GIN_MODE:                            release
      GOMEMLIMIT:                          268435456 (limits.memory)

even though GOMEMLIMIT is defined according to the memory limit of the pod, which suggests that the memory is indeed held by the program (it's not that the GC hasn't run yet).

I made a small test case here: https://github.com/netixx/scrapligo/blob/fix/b-memory-usage/channel/read_test.go

Running tool: go test -timeout 5m -run ^TestRead$ github.com/scrapli/scrapligo/channel

=== RUN   TestRead
=== RUN   TestRead/get-prompt-c
   scrapligo/channel/read_test.go:12: get-prompt-c: starting
--- PASS: TestRead/get-prompt-c (0.00s)
    scrapligo/channel/read_test.go:54: alloc: before=329336B, after=422944B
=== RUN   TestRead/get-prompt-simple
    scrapligo/channel/read_test.go:12: get-prompt-simple: starting
--- PASS: TestRead/get-prompt-simple (0.06s)
    scrapligo/channel/read_test.go:54: alloc: before=346248B, after=463584B
=== RUN   TestRead/get-prompt-a
    scrapligo/channel/read_test.go:12: get-prompt-a: starting
--- PASS: TestRead/get-prompt-a (0.07s)
    scrapligo/channel/read_test.go:54: alloc: before=349816B, after=391584B
=== RUN   TestRead/get-prompt-b
    scrapligo/channel/read_test.go:12: get-prompt-b: starting
--- PASS: TestRead/get-prompt-b (0.00s)
    scrapligo/channel/read_test.go:54: alloc: before=351896B, after=430712B
--- PASS: TestRead (0.14s)
PASS
ok      github.com/scrapli/scrapligo/channel    0.425s

We can see that the number of allocated objects in increasing at each call (not 100% sure what it shows yet, maybe a Benchmark with -benchmem would be better, but it shows allocs and not actual memory in use).

hellt commented 3 months ago

@netixx what Go version are you building this with?

netixx commented 3 months ago

❯ go version                                                                                                                                                                           ⏎
go version go1.22.1 darwin/amd64

hellt commented 3 months ago

I am just throwing things at the problem now, but maybe also run https://github.com/uber-go/goleak to see if there are goroutines that are leaked?

netixx commented 3 months ago

I tried, there does not seem to be any. I added heap profile dumps before and after running the test, and used go tool pprof --http=localhost:8081 --base %s-base.heap %s.heap to compare them.

Actually, I couldn't find anything wrong with the channel/read, I run heap profile dump before and after and compare the two, and there was no diff...

So, I tested on driver/netconf/getconfig, and there I could see additional heap data allocated after the test was finished.

However, one I added the defer clause here:

d, fileTransportObj := prepareDriver(t, testName, testCase.PayloadFile)
defer d.Close()

I couldn't see any inuse space difference anymore.

So it leads me to think that my code maybe doesn't always call Close when necessary. However

if err := conn.Open(); err != nil {
        return "", fmt.Errorf("cannot open connection to %s: %w", q.hostname, err)
    }
    defer conn.Close()

looks good to me. Or should I call conn.Close() even when open returns an error ??

hellt commented 3 months ago

Defer won't be called if an error is returned. So you can use a proper return instead of a defer when you have to close even when the error is present

carlmontanari commented 3 months ago

on this last part (the err/close bits) I think there is defo a possibility (at the moment) of the transport being left open there. easy fix, and in fact we sorta handled this in channel but not well enough I think :)... and it needs to happen in all the driver places too as there is extra stuff that happens after opening channel/transport (on open funcs mostly) where we can fail and a defer'd close call would probably never get hit like in this case.

do we know if this last bit is related to the main issue though? does the app just fail to open a few times and then leave dangling connections or is this unrelated/just a new fun thing we found here :) ?

netixx commented 3 months ago

It's my assumption that defer will always be called, however in my code I don't check the return value of "conn.Close` which may hide useful info, I will add it and report back :)

Regarding the origin of the issue: I know that there are a couple of devices that I try to connect to that are not working, since I cannot reproduce something with the unit tests (when calling Close), I am thinking that probably the tests are not covering the part of the code that has the issue, so either :

my code is not closing the conn in some cases (I think it is, because if Open() returns an error, there is nothing to close)
the unit tests do not cover bugs in the transport layer (because it uses file transport)
the scrapligo code is missing cleanup on some error case

Couple of errors I get (from the logs) is errTimeoutError: channel timeout sending input to device, read /dev/ptmx: input/output error and errConnectionError: encountered error output during in channel ssh authentication, error: 'permission denied', maybe this can help pinpoint the correct code path ?

If it can help, I captured goroutines usage while the app was sleeping (in between queries), I have goroutines running those bits:

2 @ 0x43da2e 0x4726d5 0x9795b5 0x4755c1
#   0x4726d4    time.Sleep+0x114                                /usr/local/go/src/runtime/time.go:195
#   0x9795b4    github.com/scrapli/scrapligo/driver/netconf.(*Driver).sendRPC.func1+0x34    scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/driver/netconf/rpc.go:75

but it's only a small number,

and those:

goroutine profile: total 105
38 @ 0x43da2e 0x40688d 0x4064f7 0x974f89 0x4755c1
#   0x974f88    github.com/scrapli/scrapligo/channel.(*Channel).Close.func1+0x28    /scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/channel/channel.go:187

38 @ 0x488d45 0x4872d8 0x4da4ce 0x4da4b6 0x4da351 0x4dfc12 0x4dfc0a 0x97124c 0x972167 0x975145 0x975120 0x4755c1
#   0x488d44    syscall.Syscall+0x24                        /usr/local/go/src/syscall/syscall_linux.go:69
#   0x4872d7    syscall.read+0x37                       /usr/local/go/src/syscall/zsyscall_linux_amd64.go:736
#   0x4da4cd    syscall.Read+0x2ad                      /usr/local/go/src/syscall/syscall_unix.go:181
#   0x4da4b5    internal/poll.ignoringEINTRIO+0x295             /usr/local/go/src/internal/poll/fd_unix.go:736
#   0x4da350    internal/poll.(*FD).Read+0x130                  /usr/local/go/src/internal/poll/fd_unix.go:160
#   0x4dfc11    os.(*File).read+0x51                        /usr/local/go/src/os/file_posix.go:29
#   0x4dfc09    os.(*File).Read+0x49                        /usr/local/go/src/os/file.go:118
#   0x97124b    github.com/scrapli/scrapligo/transport.(*System).Read+0x4b  scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/transport/system.go:219
#   0x972166    github.com/scrapli/scrapligo/transport.(*Transport).read+0xa6   scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/transport/transport.go:183
#   0x975144    github.com/scrapli/scrapligo/transport.(*Transport).Read+0x64   scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/transport/transport.go:188
#   0x97511f    github.com/scrapli/scrapligo/channel.(*Channel).read+0x3f   scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/channel/read.go:51

(that's 38/105 goroutines, so it seems more relevant). In that case, it seems the "Close" call is done, but it's actually waiting indefinitely ? And alongside that, it's stuck in the "Read" of the transport (system in my case).

hellt commented 3 months ago

ah, sorry, sorry, @netixx I meant to say if the Close() errors then this error is never checked indeed when defer is used...

carlmontanari commented 3 months ago

ok nice. thank you so much for all this info! this is pretty interesting stuff :D (though I wish it "just worked" instead!)

I just pushed this which I think wont "fix" things but cleans up some other stuff that all this brought to light.

a possible fix that I hate but may be the easiest/best fix is to use standard transport. system transport blocks forever on reads (fd.Read()) -- so we have a force close that ignores the lock around it and closes the fd but I dont think that actually stops the read because I think that the file descriptor doesn't support "deadlines" (docs say when you close a os.File that deadlines are set if supported and that the act of closing will auto stop any in flight reads, but doesn't seem that that happens for us). so, my hope/thought is/was that once the connection and transport go out of scope they'd get gc'd but im not sure if that actually happens.

so... testing w/ standard would be good if possible since that could "just work" then that would confirm that system is for sure the problem which and maybe give you a good path forward if that works for you.

longer term it would be nice to fix system of course if this is really an issue -- though I'm not sure it is? not ideal certainly, but is one (or even many) reads that have nothing to read a problem? I think the answer is obviously "yes" its not great, and if we can fix it of course we should but im just unsure if its actually impactful/meaingfull.

netixx commented 3 months ago

I will test your fix right away and report back. At least if we kill the transport (even if the channel remains stuck), then we resolve the memory issue (but we have a goroutine leak still).

The reason we use the system transport, is because we need to go through a custom jumphost (i.e. not only ssh forward) https://github.com/ovh/the-bastion. It can definitely work with crypto/ssh because I have done so in one of my app, but we would need a custom transport.

Can we bring our own transport to Scrapligo ? I was thinking something in the likes of WithTransport(&myTransport{}} where myTransport fits a defined interface ? I could also contribute the "bastion" transport (once I go through the connection, it should be more-or-less the same as standard transport), but that seems a little bit too niche to be included for everyone.

Side note: I am not sure I understand everything, especially since it's a pty behind, but I don't see a "syscall.SetBlocking" anywhere in the code (this issue looks similar https://github.com/golang/go/issues/29277) - not in scrapligo and not in pty.

netixx commented 3 months ago

The changed you proposed didn't fix my issue, I still have goroutines running away with their RAM :)

I did a little bit more digging, it looks like the pty lib has identified this issue: https://github.com/creack/pty/pull/167, however they reverted the fix as of now (see https://github.com/creack/pty/issues/174), so I don't know if we'll find a solution there...

But I implemented a test to reproduce the issue (and I think I succeeded), and following that I tried the following fix: https://github.com/netixx/scrapligo/blob/4020535678b0f0158d8b1f149020c28bd5f9ec42/transport/system.go#L211, that is, killing the SSH command when closing the transport (my assumption is that this in turn will shutdown everything).

Sad thing though is that I still have a lot of goroutines stuck on the read (and memory not released)...

carlmontanari commented 3 months ago

heh, yeah didn't figure that branch would fix this, but still, some hopefully fixes that came from this! :D

killing it violently like this seems like an obvious thing to do now that you say it! funny how that works eh? hah. ok cool, lemme cherrypick that and dump it on the other branch if that works for ya!?

then to recap where we are at this point:

we nicely will now (with your fix!) close the session so that is good. but we still hav ego routines hanging somehow/somewhere. I guess I would assume/hope/expect that would free the reads of the transport but... guess not?! I'll read those creak issues too.

thanks so much @netixx you are the best (sorry roman, new champ in town)

netixx commented 3 months ago

That was bothering me, that my fix fixed the test, and not the actual app... But it turns out that I "improved" my fix by checking for the ProcessState before issuing a kill.... And using the debugger I found out that the ProcessState is always nil... so I rolled back that change and I am deploying this new version now to test in the full environment.

I got rid of the goroutines stuck in the syscall read (and also the transport.read), however, there now seem to be an issue still with goroutines that are stuck in the Channel.read call(and the RAM with it)

So to summarize:

a workaround for the blocked read can be to actually kill the SSH process, I guess that then it's file descriptor get closed, and that in turns makes the syscall to read finish (will know more soon :))
I am not sure if the Kill is a good and safe fix. I would really like to issue the kill only when everything is stuck and when the process is not dead yet, but I didn't find a way to check for that. Especially since in a normal scenario the SSH will die by itself... so I think some more fixes need to be done around that. (side effects include \u001b[31mCRITICAL\u001b[0m encountered error reading from transport during channel read loop. error: read /dev/ptmx: input/output error)

carlmontanari commented 3 months ago

copy! will check this out more tomorrow probably and check out updated branch!

netixx commented 3 months ago

My guess is that the goroutine in stuck here:

if err != nil {
            // we got a transport error, put it into the error channel for processing during
            // the next read activity, log it, sleep and then try again...
            c.l.Criticalf(
                "encountered error reading from transport during channel read loop. error: %s", err,
            )

            c.Errs <- err

which is consistent with the log message I am getting.

because the channel needs to be read by Read or ReadAll but at this point it's never called anymore.

I just realised that I didn't base my fix on your branch (must have fumbled my git commands).

I will try a branch with both fixes (that should solve this IMO).

netixx commented 3 months ago

Ok, so I was able to test a combination of my "fix" and your branch, and I can report that I no longer have goroutine stuck in read (neither transport, nor channel). https://github.com/netixx/scrapligo/tree/test/system-transport-close-plus

I still have a couple of gouroutines stuck here: and thus still some memory (but much less that before) held there.

That's this piece of code:

go func() {
        var data []byte

        for {
            data = d.getMessage(m.MessageID)
            if data != nil {
                break
            }

            time.Sleep(5 * time.Microsecond) //nolint: gomnd
        }

        done <- data
    }()

I guess it needs to stop immediately when the driver is closed (and not block on done <- data) ?

carlmontanari commented 3 months ago

nice!! ok great -- I think I can get that one sorted today/tomorrow too. is that branch I cherry picked your previous fix on up to date with all your fixes? or can ya make a pr to that to add whatever else its missing maybe? then ill fix this message id one and we should be good to go!

netixx commented 3 months ago

This is just my test branch for reference.

I tested the Kill stuff, and it's safe even if the SSH process is already dead. I added a test for that.

Here is the PR on top of your existing fixes: https://github.com/scrapli/scrapligo/pull/177

There is an issue with the crypto/ssh version in go.mod. I think maybe because of the lib I am using for testing... not sure how to fix yet...

carlmontanari commented 3 months ago

awesome! yeah there is some reason I pinned back to old crypto... I dont remember what it is though :D oh look at my elaborate notes in go.mod haha:

    // v0.7.0 is the latest we can be on if we want to support 1.16
    // and v0.6.0 if we want to not piss off linters it seems

helpful past carl...

will check all this out and get stuff merged today/tomorrow and will cut new release once we are happy too. thanks so much for all the work!!

netixx commented 3 months ago

Your welcome, glad to do it to support the best github network project ❤️

carlmontanari commented 3 months ago

I merged all our combined stuff into main so I think we solved it!! huge thanks to ya'll for this work!

scrapli / scrapligo

Memory not released in channel read (maybe :) ) #174