Closed bert2002 closed 3 years ago
Thank you for the bug report!
We had another report of this in #9482. I've closed that issue so we can track it here. It sounds like this bug exists in 1.9.0 as well.
We don't have much of a lead on this yet, we'll need to do some more investigation.
Can you tell me more about how you use Consul (ex: for kv, service discovery and/or, connect service mesh) ? Do you know if you might have multiple service instances with the same name on a single node? (that shouldn't be a problem, but I wonder if it might be a factor in triggering this bug).
Greetings 👋 ! We've been experiencing the panic errors over the last couple days, we upgraded from 1.8.4 to 1.9.1 and our cluster has been crashing a few times already:
Jan 17 18:37:39 consul-server-node consul[18915]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 17 18:37:39 consul-server-node consul[18915]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x7ad532]
Jan 17 18:37:39 consul-server-node consul[18915]: goroutine 36 [running]:
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/go-immutable-radix.(*Iterator).Next(0xc001b91120, 0x0, 0xc001b91240, 0x0, 0xc0013b2c00, 0x0, 0xffffffffffffffff)
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/go-immutable-radix@v1.3.0/iter.go:178 +0xb2
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/go-memdb.(*radixIterator).Next(0xc0013b2be0, 0xc001b51260, 0x5944f)
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/go-memdb@v1.3.0/txn.go:895 +0x2e
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/consul/agent/consul/state.cleanupGatewayWildcards(0x38f5800, 0xc001b51260, 0x5944f, 0xc00141a300, 0x0, 0x0)
Jan 17 18:37:39 consul-server-node consul[18915]: /home/circleci/project/consul/agent/consul/state/catalog.go:2783 +0xe8
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/consul/agent/consul/state.(*Store).deleteServiceTxn(0xc00113a1b0, 0x38f5800, 0xc001b51260, 0x5944f, 0xc0020e5ce0, 0x10, 0xc001420500, 0x79, 0xc00141a470, 0x0, ...)
Jan 17 18:37:39 consul-server-node consul[18915]: /home/circleci/project/consul/agent/consul/state/catalog.go:1565 +0xcb0
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/consul/agent/consul/state.(*Store).deleteNodeTxn(0xc00113a1b0, 0x38f5800, 0xc001b51260, 0x5944f, 0xc0020e5ce0, 0x10, 0xb25ddc, 0xc0020cd500)
Jan 17 18:37:39 consul-server-node consul[18915]: /home/circleci/project/consul/agent/consul/state/catalog.go:715 +0x62d
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/consul/agent/consul/state.(*Store).DeleteNode(0xc00113a1b0, 0x5944f, 0xc0020e5ce0, 0x10, 0x0, 0x0)
Jan 17 18:37:39 consul-server-node consul[18915]: /home/circleci/project/consul/agent/consul/state/catalog.go:648 +0xbb
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).applyDeregister(0xc00053c240, 0xc001e0c0a1, 0x4b, 0x4b, 0x5944f, 0x0, 0x0)
Jan 17 18:37:39 consul-server-node consul[18915]: /home/circleci/project/consul/agent/consul/fsm/commands_oss.go:171 +0x41a
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/consul/agent/consul/fsm.NewFromDeps.func1(0xc001e0c0a1, 0x4b, 0x4b, 0x5944f, 0xc00059e100, 0xc0020d96c0)
Jan 17 18:37:39 consul-server-node consul[18915]: /home/circleci/project/consul/agent/consul/fsm/fsm.go:99 +0x56
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).Apply(0xc00053c240, 0xc00130bea0, 0x0, 0x0)
Jan 17 18:37:39 consul-server-node consul[18915]: /home/circleci/project/consul/agent/consul/fsm/fsm.go:133 +0x1b6
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/go-raftchunking.(*ChunkingFSM).Apply(0xc0010570b0, 0xc00130bea0, 0x5191aa0, 0xbff93b58c077682e)
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/go-raftchunking@v0.6.1/fsm.go:66 +0x5b
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc001570320)
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/raft@v1.2.0/fsm.go:90 +0x2c2
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0015e5a00, 0x40, 0x40)
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/raft@v1.2.0/fsm.go:113 +0x75
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/raft.(*Raft).runFSM(0xc0002f3500)
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/raft@v1.2.0/fsm.go:219 +0x3c4
Jan 17 18:37:39 consul-server-node consul[18915]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc0002f3500, 0xc00116a950)
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/raft@v1.2.0/state.go:146 +0x55
Jan 17 18:37:39 consul-server-node consul[18915]: created by github.com/hashicorp/raft.(*raftState).goFunc
Jan 17 18:37:39 consul-server-node consul[18915]: /go/pkg/mod/github.com/hashicorp/raft@v1.2.0/state.go:144 +0x66
Jan 17 18:37:39 consul-server-node systemd[1]: consul.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
We're using the cluster for a mix of SD and Mesh with around ~60 nodes.
Let me know if we can help you with more debugging information 🙏 ! Thanks a lot.
Can you tell me more about how you use Consul (ex: for kv, service discovery and/or, connect service mesh) ? Do you know if you might have multiple service instances with the same name on a single node? (that shouldn't be a problem, but I wonder if it might be a factor in triggering this bug).
Running in a three node cluster with connect service mesh, ACL disabled, TLS enabled and service discovery.
Thank you everyone who has reported and provided information about this panic! We have identified the problem and have a couple patches to fix it. There should be a 1.9.2 release very soon which will include the fix.
Unfortunately we haven't found any workarounds yet. The bug is triggered when a node is deleted, but it is probably hard to avoid that. Any time an agent is restarted it will perform a node delete.
When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.
Overview of the Issue
I want to upgrade from 1.8.6 to 1.9.1 and running into this nil pointer dereference.
Consul info for both Client and Server
OS: Debian 10 Consul: 1.9.1
Log Fragments
Any idea on which data it freaks out?
Cheers, bert