Closed netixx closed 3 months ago
Here is the bit of code I am using to call the driver:
// query makes sure we don't do multiple simultaneous queries to the same host.
func (p *NetconfProxy) query(q Query) (string, error) {
conn, err := p.driverFactory(q.hostname)
if err != nil {
return "", fmt.Errorf("error during netconf connection to %s: %w", q.hostname, err)
}
if err := conn.Open(); err != nil {
return "", fmt.Errorf("cannot open connection to %s: %w", q.hostname, err)
}
defer conn.Close()
if !conn.Transport.IsAlive() {
return "", fmt.Errorf("transport is dead after driver creation for %s", q.hostname)
}
var result *response.NetconfResponse
switch q.queryType {
case NetconfQueryTypeGet:
result, err = conn.Get(q.query)
case NetconfQueryTypeGetConfig:
// config is long, increase timeout
result, err = conn.GetConfig(q.query)
case NetconfQueryTypeRPC:
result, err = conn.RPC(opoptions.WithFilter(q.query))
default:
return "", fmt.Errorf("unknown query type %s", q.queryType)
}
if err != nil {
return "", fmt.Errorf("error during netconf query call: %w", err)
}
if result.Failed != nil {
return result.Result, fmt.Errorf("error during netconf query: %w", result.Failed)
}
return result.Result, nil
}
// driverFactory returns an opened netconf driver connection, ready to be queried.
func (p *NetconfProxy) driverFactory(hostname string) (*netconf.Driver, error) {
// f, _ := os.Create("scrapli.log")
logger, _ := logging.NewInstance(
logging.WithLevel(logging.Critical),
logging.WithLogger(p.log.Warn),
)
opts := []util.Option{
// reduce CPU usage because of polling loop
options.WithReadDelay(10 * time.Millisecond),
options.WithPort(22),
options.WithAuthNoStrictKey(),
options.WithAuthUsername(p.bastionUsername),
options.WithAuthPrivateKey(p.bastionPrivateKeyPath, ""),
options.WithTransportType("system"),
options.WithSystemTransportOpenArgs([]string{
"-t", "--", "-P",
hostname,
"--netconf", "-p", "830",
}),
options.WithLogger(logger),
options.WithTimeoutOps(120 * time.Second),
// options.WithChannelLog(f),
}
// juniper has a bug which make it mandatory to use self closing tags
// see https://github.com/scrapli/scrapligo/issues/73 for details
if strings.HasSuffix(hostname, "-ptx") {
opts = append(
opts,
options.WithNetconfForceSelfClosingTags(),
)
}
return netconf.NewDriver(
p.bastionFQDN,
opts...,
// options.WithChannelLog(f),
// options.WithLogger(logger),
)
}
Note that I always call conn.Close()
, the only case where I don't do it is when conn.Open()
fails, but that seemed ok when I tested it.
woof. that's not good :)
yea lemme try to see if I can reproduce and see what's up. maybe you can try to just b = nil
after he enqueue as that sounds reasonable and a quick search seems to agree w/ you that the gc cant know about if it can or cant reap the slice until its "nil'd" out/goes all the way out of scope.
I tried setting the b to nil in channel.read, but it didn't solve the issue.
I think maybe because bytes.ReplaceAll
returns a new slice, we need to define a new value for the result and nil the input after calling bytes.ReplaceAll
?
(For sake of completeness, I think rb should also be set to nil in the driver: https://github.com/scrapli/scrapligo/compare/main...netixx:scrapligo:fix/b-memory-usage?expand=1)
you've got good timing - was just going to report that the nil didn't change anything for my testing too! but I also didn't seem to really reproduce things either. running for 10m getting config every 1s I ended up w/ ~ 0.3MB usage -- when I tested shorter duration it was about the same as well which seems generally "good" but confusing in this context I think.
will check out the rb/replace all stuff now as well.
in both the graphs there is like > 10 minutes where memory is totally chill. is there nothing going on in that part? what happens at that 10 minute mark? is there anything else in this container?
Also, seems like in my case the memory is getting chewed up in the regex bits (which feels reasonable) but only in the NETCONF driver read not in the channel bits.
There are indeed other things in the container, mainly a web server (gin).
The app receives query every 3 hours, once the queries are handled, it more or less sits there waiting for the next few hours (I have another app that does refresh every 3 hours for inventory purposes).
When I focus the memory usage on the waiting time (when there are NO queries being processed), I find the following usage:
Which means that a major part of the memory is being used by the read function (but really by the bytes.Replace function), but I expect those to be release once the queries are done, especially since I do a conn.Close() after each query.
I built scrapligo with -gcflags "-m"
, and got the following regarding channel/read.go
, maybe it can help:
channel/read.go:15:6: can inline getProcessReadBufSearchDepth
channel/read.go:27:21: leaking param: rb to result ~r0 level=0
channel/read.go:34:42: ([]byte)("\n") does not escape
channel/read.go:43:7: leaking param content: c
channel/read.go:51:21: inlining call to transport.(*Transport).Read
channel/read.go:55:17: ... argument does not escape
channel/read.go:80:23: inlining call to bytes.ReplaceAll
channel/read.go:80:34: ([]byte)("\r") does not escape
channel/read.go:80:48: ([]byte)("") does not escape
channel/read.go:80:48: zero-copy string->[]byte conversion
channel/read.go:81:20: inlining call to bytes.Contains
channel/read.go:81:31: ([]byte)("\x1b") does not escape
channel/read.go:90:18: ... argument does not escape
channel/read.go:102:7: leaking param content: c
channel/read.go:115:12: ... argument does not escape
channel/read.go:115:39: string(b) escapes to heap
channel/read.go:115:40: string(b) escapes to heap
channel/read.go:125:7: leaking param content: c
channel/read.go:138:12: ... argument does not escape
channel/read.go:138:39: string(b) escapes to heap
channel/read.go:138:40: string(b) escapes to heap
channel/read.go:144:7: leaking param content: c
channel/read.go:144:34: b does not escape
channel/read.go:167:51: inlining call to getProcessReadBufSearchDepth
channel/read.go:176:7: leaking param content: c
channel/read.go:176:37: b does not escape
channel/read.go:193:20: inlining call to bytes.Contains
channel/read.go:194:51: inlining call to getProcessReadBufSearchDepth
channel/read.go:204:7: leaking param content: c
channel/read.go:221:27: inlining call to regexp.(*Regexp).Match
channel/read.go:221:27: inlining call to regexp.(*Regexp).doMatch
channel/read.go:229:7: leaking param content: c
channel/read.go:229:38: leaking param content: prompts
channel/read.go:249:14: inlining call to regexp.(*Regexp).Match
channel/read.go:249:14: inlining call to regexp.(*Regexp).doMatch
This one caught my attention:
channel/read.go:80:48: zero-copy string->[]byte conversion
I don't see any string
to start with...
I also was thinking that this could be an issue: https://stackoverflow.com/questions/34436863/can-this-type-of-golang-string-slicing-leak-memory-in-underlying-byte-array. But since we expect to totally clear the b
when we close the connection, it should get released ?
By the way, it does run OOM:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 02 Apr 2024 03:07:52 +0200
Finished: Tue, 02 Apr 2024 09:13:54 +0200
Ready: True
Restart Count: 2
Limits:
cpu: 400m
ephemeral-storage: 200M
memory: 256Mi
Requests:
cpu: 100m
ephemeral-storage: 100M
memory: 128Mi
Environment:
GIN_MODE: release
GOMEMLIMIT: 268435456 (limits.memory)
even though GOMEMLIMIT
is defined according to the memory limit of the pod, which suggests that the memory is indeed held by the program (it's not that the GC hasn't run yet).
I made a small test case here: https://github.com/netixx/scrapligo/blob/fix/b-memory-usage/channel/read_test.go
Running tool: go test -timeout 5m -run ^TestRead$ github.com/scrapli/scrapligo/channel
=== RUN TestRead
=== RUN TestRead/get-prompt-c
scrapligo/channel/read_test.go:12: get-prompt-c: starting
--- PASS: TestRead/get-prompt-c (0.00s)
scrapligo/channel/read_test.go:54: alloc: before=329336B, after=422944B
=== RUN TestRead/get-prompt-simple
scrapligo/channel/read_test.go:12: get-prompt-simple: starting
--- PASS: TestRead/get-prompt-simple (0.06s)
scrapligo/channel/read_test.go:54: alloc: before=346248B, after=463584B
=== RUN TestRead/get-prompt-a
scrapligo/channel/read_test.go:12: get-prompt-a: starting
--- PASS: TestRead/get-prompt-a (0.07s)
scrapligo/channel/read_test.go:54: alloc: before=349816B, after=391584B
=== RUN TestRead/get-prompt-b
scrapligo/channel/read_test.go:12: get-prompt-b: starting
--- PASS: TestRead/get-prompt-b (0.00s)
scrapligo/channel/read_test.go:54: alloc: before=351896B, after=430712B
--- PASS: TestRead (0.14s)
PASS
ok github.com/scrapli/scrapligo/channel 0.425s
We can see that the number of allocated objects in increasing at each call (not 100% sure what it shows yet, maybe a Benchmark with -benchmem would be better, but it shows allocs and not actual memory in use).
@netixx what Go version are you building this with?
❯ go version ⏎
go version go1.22.1 darwin/amd64
I am just throwing things at the problem now, but maybe also run https://github.com/uber-go/goleak to see if there are goroutines that are leaked?
I tried, there does not seem to be any. I added heap profile dumps before and after running the test, and used go tool pprof --http=localhost:8081 --base %s-base.heap %s.heap
to compare them.
Actually, I couldn't find anything wrong with the channel/read, I run heap profile dump before and after and compare the two, and there was no diff...
So, I tested on driver/netconf/getconfig, and there I could see additional heap data allocated after the test was finished.
However, one I added the defer clause here:
d, fileTransportObj := prepareDriver(t, testName, testCase.PayloadFile)
defer d.Close()
I couldn't see any inuse space difference anymore.
So it leads me to think that my code maybe doesn't always call Close
when necessary.
However
if err := conn.Open(); err != nil {
return "", fmt.Errorf("cannot open connection to %s: %w", q.hostname, err)
}
defer conn.Close()
looks good to me. Or should I call conn.Close()
even when open returns an error ??
Defer won't be called if an error is returned. So you can use a proper return instead of a defer when you have to close even when the error is present
on this last part (the err/close bits) I think there is defo a possibility (at the moment) of the transport being left open there. easy fix, and in fact we sorta handled this in channel but not well enough I think :)... and it needs to happen in all the driver places too as there is extra stuff that happens after opening channel/transport (on open funcs mostly) where we can fail and a defer'd close call would probably never get hit like in this case.
do we know if this last bit is related to the main issue though? does the app just fail to open a few times and then leave dangling connections or is this unrelated/just a new fun thing we found here :) ?
It's my assumption that defer
will always be called, however in my code I don't check the return value of "conn.Close` which may hide useful info, I will add it and report back :)
Regarding the origin of the issue: I know that there are a couple of devices that I try to connect to that are not working, since I cannot reproduce something with the unit tests (when calling Close), I am thinking that probably the tests are not covering the part of the code that has the issue, so either :
Open()
returns an error, there is nothing to close)Couple of errors I get (from the logs) is errTimeoutError: channel timeout sending input to device
, read /dev/ptmx: input/output error
and errConnectionError: encountered error output during in channel ssh authentication, error: 'permission denied'
, maybe this can help pinpoint the correct code path ?
If it can help, I captured goroutines usage while the app was sleeping (in between queries), I have goroutines running those bits:
2 @ 0x43da2e 0x4726d5 0x9795b5 0x4755c1
# 0x4726d4 time.Sleep+0x114 /usr/local/go/src/runtime/time.go:195
# 0x9795b4 github.com/scrapli/scrapligo/driver/netconf.(*Driver).sendRPC.func1+0x34 scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/driver/netconf/rpc.go:75
but it's only a small number,
and those:
goroutine profile: total 105
38 @ 0x43da2e 0x40688d 0x4064f7 0x974f89 0x4755c1
# 0x974f88 github.com/scrapli/scrapligo/channel.(*Channel).Close.func1+0x28 /scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/channel/channel.go:187
38 @ 0x488d45 0x4872d8 0x4da4ce 0x4da4b6 0x4da351 0x4dfc12 0x4dfc0a 0x97124c 0x972167 0x975145 0x975120 0x4755c1
# 0x488d44 syscall.Syscall+0x24 /usr/local/go/src/syscall/syscall_linux.go:69
# 0x4872d7 syscall.read+0x37 /usr/local/go/src/syscall/zsyscall_linux_amd64.go:736
# 0x4da4cd syscall.Read+0x2ad /usr/local/go/src/syscall/syscall_unix.go:181
# 0x4da4b5 internal/poll.ignoringEINTRIO+0x295 /usr/local/go/src/internal/poll/fd_unix.go:736
# 0x4da350 internal/poll.(*FD).Read+0x130 /usr/local/go/src/internal/poll/fd_unix.go:160
# 0x4dfc11 os.(*File).read+0x51 /usr/local/go/src/os/file_posix.go:29
# 0x4dfc09 os.(*File).Read+0x49 /usr/local/go/src/os/file.go:118
# 0x97124b github.com/scrapli/scrapligo/transport.(*System).Read+0x4b scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/transport/system.go:219
# 0x972166 github.com/scrapli/scrapligo/transport.(*Transport).read+0xa6 scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/transport/transport.go:183
# 0x975144 github.com/scrapli/scrapligo/transport.(*Transport).Read+0x64 scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/transport/transport.go:188
# 0x97511f github.com/scrapli/scrapligo/channel.(*Channel).read+0x3f scrapligo@v1.2.1-0.20240328143024-74b8c9369fde/channel/read.go:51
(that's 38/105 goroutines, so it seems more relevant). In that case, it seems the "Close" call is done, but it's actually waiting indefinitely ? And alongside that, it's stuck in the "Read" of the transport (system in my case).
ah, sorry, sorry, @netixx I meant to say if the Close() errors then this error is never checked indeed when defer is used...
ok nice. thank you so much for all this info! this is pretty interesting stuff :D (though I wish it "just worked" instead!)
I just pushed this which I think wont "fix" things but cleans up some other stuff that all this brought to light.
a possible fix that I hate but may be the easiest/best fix is to use standard transport. system transport blocks forever on reads (fd.Read()) -- so we have a force close that ignores the lock around it and closes the fd but I dont think that actually stops the read because I think that the file descriptor doesn't support "deadlines" (docs say when you close a os.File that deadlines are set if supported and that the act of closing will auto stop any in flight reads, but doesn't seem that that happens for us). so, my hope/thought is/was that once the connection and transport go out of scope they'd get gc'd but im not sure if that actually happens.
so... testing w/ standard would be good if possible since that could "just work" then that would confirm that system is for sure the problem which and maybe give you a good path forward if that works for you.
longer term it would be nice to fix system of course if this is really an issue -- though I'm not sure it is? not ideal certainly, but is one (or even many) reads that have nothing to read a problem? I think the answer is obviously "yes" its not great, and if we can fix it of course we should but im just unsure if its actually impactful/meaingfull.
I will test your fix right away and report back. At least if we kill the transport (even if the channel remains stuck), then we resolve the memory issue (but we have a goroutine leak still).
The reason we use the system transport, is because we need to go through a custom jumphost (i.e. not only ssh forward) https://github.com/ovh/the-bastion. It can definitely work with crypto/ssh
because I have done so in one of my app, but we would need a custom transport.
Can we bring our own transport to Scrapligo ? I was thinking something in the likes of WithTransport(&myTransport{}}
where myTransport
fits a defined interface ?
I could also contribute the "bastion" transport (once I go through the connection, it should be more-or-less the same as standard transport), but that seems a little bit too niche to be included for everyone.
Side note: I am not sure I understand everything, especially since it's a pty behind, but I don't see a "syscall.SetBlocking" anywhere in the code (this issue looks similar https://github.com/golang/go/issues/29277) - not in scrapligo and not in pty.
The changed you proposed didn't fix my issue, I still have goroutines running away with their RAM :)
I did a little bit more digging, it looks like the pty lib has identified this issue: https://github.com/creack/pty/pull/167, however they reverted the fix as of now (see https://github.com/creack/pty/issues/174), so I don't know if we'll find a solution there...
But I implemented a test to reproduce the issue (and I think I succeeded), and following that I tried the following fix: https://github.com/netixx/scrapligo/blob/4020535678b0f0158d8b1f149020c28bd5f9ec42/transport/system.go#L211, that is, killing the SSH command when closing the transport (my assumption is that this in turn will shutdown everything).
Sad thing though is that I still have a lot of goroutines stuck on the read (and memory not released)...
heh, yeah didn't figure that branch would fix this, but still, some hopefully fixes that came from this! :D
killing it violently like this seems like an obvious thing to do now that you say it! funny how that works eh? hah. ok cool, lemme cherrypick that and dump it on the other branch if that works for ya!?
then to recap where we are at this point:
we nicely will now (with your fix!) close the session so that is good. but we still hav ego routines hanging somehow/somewhere. I guess I would assume/hope/expect that would free the reads of the transport but... guess not?! I'll read those creak issues too.
thanks so much @netixx you are the best (sorry roman, new champ in town)
That was bothering me, that my fix fixed the test, and not the actual app...
But it turns out that I "improved" my fix by checking for the ProcessState
before issuing a kill.... And using the debugger I found out that the ProcessState is always nil... so I rolled back that change and I am deploying this new version now to test in the full environment.
I got rid of the goroutines stuck in the syscall read (and also the transport.read), however, there now seem to be an issue still with goroutines that are stuck in the Channel.read call(and the RAM with it)
So to summarize:
Kill
is a good and safe fix. I would really like to issue the kill only when everything is stuck and when the process is not dead yet, but I didn't find a way to check for that. Especially since in a normal scenario the SSH will die by itself... so I think some more fixes need to be done around that. (side effects include \u001b[31mCRITICAL\u001b[0m encountered error reading from transport during channel read loop. error: read /dev/ptmx: input/output error)copy! will check this out more tomorrow probably and check out updated branch!
My guess is that the goroutine in stuck here:
if err != nil {
// we got a transport error, put it into the error channel for processing during
// the next read activity, log it, sleep and then try again...
c.l.Criticalf(
"encountered error reading from transport during channel read loop. error: %s", err,
)
c.Errs <- err
which is consistent with the log message I am getting.
because the channel needs to be read by Read
or ReadAll
but at this point it's never called anymore.
I just realised that I didn't base my fix on your branch (must have fumbled my git commands).
I will try a branch with both fixes (that should solve this IMO).
Ok, so I was able to test a combination of my "fix" and your branch, and I can report that I no longer have goroutine stuck in read (neither transport, nor channel). https://github.com/netixx/scrapligo/tree/test/system-transport-close-plus
I still have a couple of gouroutines stuck here:
and thus still some memory (but much less that before) held there.
That's this piece of code:
go func() {
var data []byte
for {
data = d.getMessage(m.MessageID)
if data != nil {
break
}
time.Sleep(5 * time.Microsecond) //nolint: gomnd
}
done <- data
}()
I guess it needs to stop immediately when the driver is closed (and not block on done <- data) ?
nice!! ok great -- I think I can get that one sorted today/tomorrow too. is that branch I cherry picked your previous fix on up to date with all your fixes? or can ya make a pr to that to add whatever else its missing maybe? then ill fix this message id one and we should be good to go!
This is just my test branch for reference.
I tested the Kill stuff, and it's safe even if the SSH process is already dead. I added a test for that.
Here is the PR on top of your existing fixes: https://github.com/scrapli/scrapligo/pull/177
There is an issue with the crypto/ssh version in go.mod. I think maybe because of the lib I am using for testing... not sure how to fix yet...
awesome! yeah there is some reason I pinned back to old crypto... I dont remember what it is though :D oh look at my elaborate notes in go.mod haha:
// v0.7.0 is the latest we can be on if we want to support 1.16
// and v0.6.0 if we want to not piss off linters it seems
helpful past carl...
will check all this out and get stuff merged today/tomorrow and will cut new release once we are happy too. thanks so much for all the work!!
Your welcome, glad to do it to support the best github network project ❤️
I merged all our combined stuff into main so I think we solved it!! huge thanks to ya'll for this work!
Hello,
I am running a REST API that runs netconf in the backend.
I am investigating an issue where my app does not release a big chunk of it's memory.![image](https://github.com/scrapli/scrapligo/assets/1488603/3521e50a-e531-4315-9514-fbb198845bab)
I used profiling to identify what is holding the memory, and I got the following:![image](https://github.com/scrapli/scrapligo/assets/1488603/f6b4f1be-49df-45a5-a6be-602dd698f28c)
It seems that the
read
function is holding a lot of memory, event though there are no more queries being processed. I am not sure why it doesn't get cleaned up by the garbage collector though. Profiling suggests that it's somewhere around here: https://github.com/scrapli/scrapligo/blob/ed0178113d9d227f9cef86dc42001f4f2da48d4c/channel/read.go#L43.My first guess is that after
c.Q.Enqueue(b)
b
should be wiped ?Do you have any idea why ? Would you have some time to troubleshoot together ?