lima-vm / lima

Linux virtual machines, with a focus on running containers
https://lima-vm.io/
Apache License 2.0
15.39k stars 604 forks source link

gRPC port forwarder seems to occasionally hang (with `template://docker`); also panics #2859

Closed AkihiroSuda closed 1 week ago

AkihiroSuda commented 1 week ago

The gRPC port forwarder introduced in Lima v1.0 seems to have a stability issue with template://docker:

https://github.com/lima-vm/lima/actions/runs/11638179464/job/32412581211

+ limactl shell docker docker pull --quiet ghcr.io/stargz-containers/nginx:1.19-alpine-org
time="2024-11-02T00:34:15Z" level=warning msg="treating lima version \"5329fd4\" from \"/home/runner/.lima/docker/lima-version\" as very latest release"
ghcr.io/stargz-containers/nginx:1.19-alpine-org
+ limactl shell docker docker run -d --name nginx -p 127.0.0.1:8080:80 ghcr.io/stargz-containers/nginx:1.19-alpine-org
time="2024-11-02T00:34:29Z" level=warning msg="treating lima version \"5329fd4\" from \"/home/runner/.lima/docker/lima-version\" as very latest release"
c8c19bbaf7fb13352c2521819495e1ed61f1e3d799d3de42b8d9bd70ee0ea265
+ timeout 3m bash -euxc 'until curl -f --retry 30 --retry-connrefused http://127.0.0.1:8080; do sleep 3; done'
+ curl -f --retry 30 --retry-connrefused http://127.0.0.1:8080/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:09 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:11 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:12 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:13 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:14 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:15 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:16 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:17 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:18 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:19 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:20 --:--:--     0
[...]
  0     0    0     0    0     0      0      0 --:--:--  0:02:59 --:--:--     0

LIMA_SSH_PORT_FORWARDER=true may work as a workaround.

jandubois commented 1 week ago

I'm not sure if this is related, and it is from a build from last week and not 1.0.0, but I just saw this crash (when shutting down Lima):

{"level":"info","msg":"Shutting down QEMU with the power button","time":"2024-11-06T18:49:38-07:00"}
{"level":"info","msg":"Sending QMP system_powerdown command","time":"2024-11-06T18:49:38-07:00"}
{"level":"error","msg":"could not open tcp tunnel for id: tcp-127.0.0.1:6443-127.0.0.1:52705 error:rpc error: code = Canceled desc = context canceled","time":"2024-11-06T18:49:41-07:00"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x30 pc=0x102d45500]

goroutine 1077 [running]:
github.com/lima-vm/lima/pkg/portfwd.GrpcClientRW.Read({{0x140016fd9e0, 0x22}, {0x140008dab94, 0x9}, {0x0, 0x0}}, {0x14000aa0000, 0x8000, 0x0?})
        /Users/runner/work/rancher-desktop-lima/rancher-desktop-lima/lima/pkg/portfwd/client.go:134 +0x40
io.copyBuffer({0x10329e0c0, 0x1400049c4b8}, {0x10329f6c0, 0x140005ff260}, {0x0, 0x0, 0x0})
        /Users/runner/hostedtoolcache/go/1.23.2/x64/src/io/io.go:429 +0x18c
io.Copy(...)
        /Users/runner/hostedtoolcache/go/1.23.2/x64/src/io/io.go:388
net.genericReadFrom(0x14000984b00?, {0x10329f6c0, 0x140005ff260})
        /Users/runner/hostedtoolcache/go/1.23.2/x64/src/net/net.go:762 +0x58
net.(*TCPConn).readFrom(0x1400049c470, {0x10329f6c0, 0x140005ff260})
        /Users/runner/hostedtoolcache/go/1.23.2/x64/src/net/tcpsock_posix.go:54 +0x50
net.(*TCPConn).ReadFrom(0x1400049c470, {0x10329f6c0?, 0x140005ff260?})
        /Users/runner/hostedtoolcache/go/1.23.2/x64/src/net/tcpsock.go:165 +0x30
io.copyBuffer({0x12b650ef0, 0x1400049c470}, {0x10329f6c0, 0x140005ff260}, {0x0, 0x0, 0x0})
        /Users/runner/hostedtoolcache/go/1.23.2/x64/src/io/io.go:415 +0x14c
io.Copy(...)
        /Users/runner/hostedtoolcache/go/1.23.2/x64/src/io/io.go:388
github.com/lima-vm/lima/pkg/portfwd.HandleTCPConnection.func2()
        /Users/runner/work/rancher-desktop-lima/rancher-desktop-lima/lima/pkg/portfwd/client.go:37 +0x60
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /Users/runner/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78 +0x54
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1075
        /Users/runner/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:75 +0x98
AkihiroSuda commented 1 week ago

Saw another panic

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10bf5c444]

goroutine 1459 [running]:
github.com/lima-vm/lima/pkg/portfwd.HandleTCPConnection({0x10c4c8780, 0xc0006d6d20}, 0xc0006162b0, {0x0, 0x0}, {0xc0005a4090, 0xa})
    /private/tmp/lima-20241106-4977-k63hh8/lima-1.0.0/pkg/portfwd/client.go:17 +0x64
created by github.com/lima-vm/lima/pkg/portfwd.(*ClosableListeners).forwardTCP in goroutine 1401
    /private/tmp/lima-20241106-4977-k63hh8/lima-1.0.0/pkg/portfwd/listener.go:90 +0x48
balajiv113 commented 1 week ago

@AkihiroSuda This panic occurred during stop or while running itself ??

AkihiroSuda commented 1 week ago

Panicked while running itself, when playing around with docker run -p 8080:80 nginx:alpine with various ports

AkihiroSuda commented 1 week ago

Looks like -p 80:80 is enough to trigger the panic in portfwd.HandleTCPConnection.

Probably the panic in portfwd.GrpcClientRW.Read reported by @jandubois is irrelevant.

jandubois commented 1 week ago

I also see the following errors earlier in ha.stderr.log (sometimes a lot more); not sure if they are expected:

{"level":"debug","msg":"error in tcp tunnel for id: tcp-127.0.0.1:6443-127.0.0.1:52484 error:readfrom tcp 127.0.0.1:6443-\u003e127.0.0.1:52484: rpc error: code = Unavailable desc = error reading from server: EOF","time":"2024-11-06T18:49:35-07:00"}
{"level":"debug","msg":"error in tcp tunnel for id: tcp-127.0.0.1:6443-127.0.0.1:52521 error:readfrom tcp 127.0.0.1:6443-\u003e127.0.0.1:52521: rpc error: code = Unavailable desc = error reading from server: EOF","time":"2024-11-06T18:49:35-07:00"}
{"level":"debug","msg":"error in tcp tunnel for id: tcp-127.0.0.1:6443-127.0.0.1:52554 error:readfrom tcp 127.0.0.1:6443-\u003e127.0.0.1:52554: rpc error: code = Unavailable desc = error reading from server: EOF","time":"2024-11-06T18:49:35-07:00"}
{"level":"debug","msg":"error in tcp tunnel for id: tcp-127.0.0.1:6443-127.0.0.1:52555 error:readfrom tcp 127.0.0.1:6443-\u003e127.0.0.1:52555: rpc error: code = Unavailable desc = error reading from server: EOF","time":"2024-11-06T18:49:35-07:00"}
{"level":"debug","msg":"error in tcp tunnel for id: tcp-[::1]:80-[::1]:52697 error:readfrom tcp [::1]:80-\u003e[::1]:52697: rpc error: code = Unavailable desc = error reading from server: EOF","time":"2024-11-06T18:49:35-07:00"}

Note that I'm not using an Ubuntu image, but Alpine. But it is a complicated test case, using k3s with spinkube operator running a wasm test... Trying to narrow it down.

It does however fail every single time (at least on aarch64, still trying to repro on Intel), so would be a good test to validate a fix...

jandubois commented 1 week ago

Probably the panic in portfwd.GrpcClientRW.Read reported by @jandubois is irrelevant.

Should I create a separate issue for this?

BTW, I can also repro this on Intel; crash happens on every shutdown.

AkihiroSuda commented 1 week ago

Yes, probably it should be a separate issue

BTW if we can't fix the issues immediately we should probably switch back the default to SSH forwarder and release v1.0.1 right now.

jandubois commented 1 week ago

I think it is unrelated, but this looks suspicious: https://github.com/lima-vm/lima/blob/6cd896f0d6196f75ef2af39d25e73f42284b78af/pkg/portfwd/client.go#L21-L24

We continue on, even when stream is not set, and later dereference it as g.stream.Recv or g.stream.Send.

Should there be a return after logging the error?

jandubois commented 1 week ago

Yes, probably it should be a separate issue

I created #2860 to fix it.

rfay commented 1 week ago

We are seeing 3 things with Lima 1.0.0 with docker, probably two are the same thing, so I'll mention them in this issue:

  1. Host-bound ports suddenly stop being bound. All of them, from all containers. At the same time. This happens after a few http requests. It's reproducible in modest time, but definitely intermittent.
  2. Docker actions hang permanently. This prevented us from getting any successful runs of lima or colima automated tests today, until this PR setting LIMA_SSH_PORT_FORWARDER=true

That PR seems to have solved it, and we have good runs now, no more hangs.

The third thing is not yet reproducible, but it's where the lima instance just stops of its own accord. limactl list shows it as stopped.

jandubois commented 1 week ago

Looks like -p 80:80 is enough to trigger the panic in portfwd.HandleTCPConnection.

Can you post exact repro steps; I can't repro this with 1.0.0

AkihiroSuda commented 1 week ago

Looks like -p 80:80 is enough to trigger the panic in portfwd.HandleTCPConnection.

Can you post exact repro steps; I can't repro this with 1.0.0

limactl start template://docker
limactl shell docker docker run -d --name nginx -p 80:80 nginx:alpine
curl localhost
cat ~/.lima/docker/ha.stderr.log

(macOS 15.1, Intel, vz, current master 4d56dab41bf9d0ad6bc3288cd49b818627e378b7)

This panic is likely causing "1. Host-bound ports suddenly stop being bound."