microsoft / retina

eBPF distributed networking observability tool for Kubernetes
https://retina.sh
MIT License
2.68k stars 198 forks source link

retina-agent pod failed when enabled tcpretrans plugins: invalid memory address or nil pointer dereference #311

Closed wenhuwang closed 4 months ago

wenhuwang commented 4 months ago

Describe the bug Retina-agent can run normally when the tcpretrans plugin is turned off, but fails when the tcpretrans plugin is turned on. The error log is as follows:

ts=2024-04-25T07:14:26.112Z level=warn caller=tcpretrans/tcpretrans_linux.go:91 msg="tcpretrans plugin does not have a gadget context" goversion=go1.21.9 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-ckhc4 version=v0.0.8 apiserver=https://10.68.0.1:443 plugins=packetforward,linuxutil,dns,tcpretrans
ts=2024-04-25T07:14:26.112Z level=info caller=tcpretrans/tcpretrans_linux.go:59 msg="Initialized tcpretrans plugin" goversion=go1.21.9 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-ckhc4 version=v0.0.8 apiserver=https://10.68.0.1:443 plugins=packetforward,linuxutil,dns,tcpretrans
ts=2024-04-25T07:14:26.112Z level=info caller=pluginmanager/pluginmanager.go:122 msg="Reconciled plugin" goversion=go1.21.9 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-ckhc4 version=v0.0.8 apiserver=https://10.68.0.1:443 plugins=packetforward,linuxutil,dns,tcpretrans name=tcpretrans
ts=2024-04-25T07:14:26.112Z level=info caller=pluginmanager/pluginmanager.go:173 msg="starting plugin tcpretrans" goversion=go1.21.9 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-ckhc4 version=v0.0.8 apiserver=https://10.68.0.1:443 plugins=packetforward,linuxutil,dns,tcpretrans
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a84d40]

goroutine 725 [running]:
github.com/cilium/ebpf.(*MapSpec).Compatible(0x25bcf60?, 0x0)
    /go/pkg/mod/github.com/cilium/ebpf@v0.14.0/map.go:196 +0x40
github.com/cilium/ebpf.newCollectionLoader(0xc002aac960, 0xc002acbad8?)
    /go/pkg/mod/github.com/cilium/ebpf@v0.14.0/collection.go:425 +0xff
github.com/cilium/ebpf.(*CollectionSpec).LoadAndAssign(0xc002aac960, {0x25a13c0, 0xc002cd1f50}, 0xe?)
    /go/pkg/mod/github.com/cilium/ebpf@v0.14.0/collection.go:280 +0x52
github.com/inspektor-gadget/inspektor-gadget/pkg/gadgets/trace/tcpretrans/tracer.(*Tracer).install(0xc002cd1f40)
    /go/pkg/mod/github.com/inspektor-gadget/inspektor-gadget@v0.25.1-0.20240223044605-4ac24c3e3b7f/pkg/gadgets/trace/tcpretrans/tracer/tracer.go:108 +0x17c
github.com/inspektor-gadget/inspektor-gadget/pkg/gadgets/trace/tcpretrans/tracer.(*Tracer).Run(0xc002cd1f40, {0x2f201d0, 0xc002bfc000})
    /go/pkg/mod/github.com/inspektor-gadget/inspektor-gadget@v0.25.1-0.20240223044605-4ac24c3e3b7f/pkg/gadgets/trace/tcpretrans/tracer/tracer.go:56 +0x65
github.com/microsoft/retina/pkg/plugin/tcpretrans.(*tcpretrans).Start(0xc0017120c0, {0x2f15b50?, 0xc000b88320?})
    /go/src/github.com/microsoft/retina/pkg/plugin/tcpretrans/tcpretrans_linux.go:77 +0x112
github.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start.func1()
    /go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:174 +0xc2
golang.org/x/sync/errgroup.(*Group).Go.func1()
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 547
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x96

To Reproduce Enable tcpretrans plugin

Expected behavior retina-agent pod status is normal.

Platform (please complete the following information):

wenhuwang commented 4 months ago

@rbtr Please feel free to assign it to me. Thank you so much.

timraymond commented 4 months ago

@wenhuwang Can you check to see if v0.0.9 incidentally fixes this issue? I've been attempting to repro this issue to review your associated PR, but I haven't been successful.

boniek83 commented 4 months ago

I'm running 0ed933e container image tag and still get it.

rbtr commented 4 months ago

I am able to repro this. Testing if the proposed fix in #322 works and will merge it and tag a new release, if so.

rbtr commented 4 months ago

322 still has problems

wenhuwang commented 4 months ago

@wenhuwang Can you check to see if v0.0.9 incidentally fixes this issue? I've been attempting to repro this issue to review your associated PR, but I haven't been successful.

@timraymond Is the error you encountered the same as this problem?

rbtr commented 4 months ago

fixed in #322 and v0.0.11