hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

fatal error: unexpected signal during runtime execution #8558

Open fanatl opened 4 years ago

fanatl commented 4 years ago

Overview of the Issue

After upgrading from version 1.4.1 to 1.7.2 consul agent periodically restarts or hangs

Reproduction Steps

Consul v1.7.2 3 servers 254 agents

Consul info for both Client and Server

Client info ``` agent: check_monitors = 0 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 9ea1a204 version = 1.7.2 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 72 goroutines = 96 max_procs = 72 os = linux version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 295 failed = 2 health_score = 0 intent_queue = 0 left = 0 member_time = 914365 members = 256 query_queue = 0 query_time = 607 ```
Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 6 services = 6 build: prerelease = revision = 9ea1a204 version = 1.7.2 consul: acl = disabled bootstrap = false known_datacenters = 6 leader = false leader_addr = 172.16.200.32:8300 server = true raft: applied_index = 364750949 commit_index = 364750949 fsm_pending = 0 last_contact = 15.676487ms last_log_index = 364750949 last_log_term = 16 last_snapshot_index = 364740740 last_snapshot_term = 16 latest_configuration = [{Suffrage:Voter ID:19c90ce8-ed90-ec59-bcb5-f3c2373fe6d2 Address:172.16.200.53:8300} {Suffrage:Voter ID:609cd8f2-b630-1b49-dc2f-db5889c72d42 Address:172.16.200.32:8300} {Suffrage:Voter ID:1b8a5854-e5e9-5072-e855-90c0758973aa Address:172.16.200.11:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 16 runtime: arch = amd64 cpu_count = 48 goroutines = 784 max_procs = 48 os = linux version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 295 failed = 3 health_score = 0 intent_queue = 0 left = 0 member_time = 914366 members = 257 query_queue = 0 query_time = 607 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 3418 members = 20 query_queue = 0 query_time = 34 ```

Operating system and Environment details

OS: Oracle Linux Server release 7.6

Architecture: x86_64

Procinfo ``` processor : 71 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz stepping : 4 microcode : 0x2000065 cpu MHz : 2999.876 cache size : 25344 KB physical id : 1 siblings : 36 core id : 27 cpu cores : 18 apicid : 119 initial apicid : 119 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf bogomips : 4617.46 clflush size : 64 cache_alignment : 64 address sizes : 47 bits physical, 48 bits virtual power management: ```
Meminfo ``` MemTotal: 1053580972 kB MemFree: 10921816 kB MemAvailable: 994938308 kB Buffers: 67404 kB Cached: 990592488 kB SwapCached: 0 kB Active: 338375648 kB Inactive: 671238784 kB Active(anon): 20888036 kB Inactive(anon): 2275220 kB Active(file): 317487612 kB Inactive(file): 668963564 kB Unevictable: 13740 kB Mlocked: 13740 kB SwapTotal: 16777212 kB SwapFree: 16777212 kB Dirty: 2711660 kB Writeback: 0 kB AnonPages: 18685476 kB Mapped: 1926308 kB Shmem: 4209344 kB Slab: 30215320 kB SReclaimable: 29884528 kB SUnreclaim: 330792 kB KernelStack: 24944 kB PageTables: 100816 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 543567696 kB Committed_AS: 25208224 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 17924096 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7847596 kB DirectMap2M: 584421376 kB DirectMap1G: 480247808 kB ```

Log Fragments

goroutine 19 [running]:
runtime.throw(0x30c01b6, 0x2a)
/usr/local/go/src/runtime/panic.go:774 +0x72 fp=0xc0001baf30 sp=0xc0001baf00 pc=0x42f482
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:378 +0x47c fp=0xc0001baf60 sp=0xc0001baf30 pc=0x444f6c
runtime.timerproc(0x50bd2c0)
/usr/local/go/src/runtime/time.go:260 +0xa2 fp=0xc0001bafd8 sp=0xc0001baf60 pc=0x44e172
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc0001bafe0 sp=0xc0001bafd8 pc=0x45f3f1
created by runtime.(*timersBucket).addtimerLocked
/usr/local/go/src/runtime/time.go:169 +0x10e
goroutine 1 [select, 1482 minutes]:
github.com/hashicorp/consul/command/agent.(*cmd).run(0xc0001cf500, 0xc000174140, 0x4, 0x4, 0x0)
/home/circleci/project/consul/command/agent/agent.go:331 +0x13eb
github.com/hashicorp/consul/command/agent.(*cmd).Run(0xc0001cf500, 0xc000174140, 0x4, 0x4, 0xc00000cec0)
/home/circleci/project/consul/command/agent/agent.go:78 +0x4d
github.com/mitchellh/cli.(*CLI).Run(0xc00019a780, 0xc00019a780, 0x80, 0xc00000d200)
/go/pkg/mod/github.com/mitchellh/cli@v1.0.0/cli.go:255 +0x1da
stepanovmm1992 commented 4 years ago

Yes, I have same problem.

dnephin commented 4 years ago

Thank you for the report! This sounds like it may be an issue with the go runtime.

For anyone who has hit this problem, which Linux kernel version are you using (uname -a) ?

Release v1.7.2 was built with go1.13.7. This Go issue seems like it might be related: https://github.com/golang/go/issues/35777

I believe this was fixed in go1.14, which we use to build the v1.8.x releases. Upgrading to 1.8.x may resolve the problem.

Later v1.7.x releases (ex: 1.7.7) were also built with newer version of go, which may also include the fix.

fanatl commented 4 years ago

uname -a

Linux 4.14.35-1818.3.3.el7uek.x86_64 #2 SMP Mon Sep 24 14:45:01 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Version 1.7.7 is installed. Reboots are still happening.

Detailed logs are attached.

consul-server-2_2020.08.28.log consul-agent-kn-0033_2020.08.28.log consul-agent-kn_0030_2020.08.28.log

dnephin commented 4 years ago

That you for the report and the logs! I'm not sure what is happening here, but from what I can tell it is an issue with the Go runtime. I've opened an issue on the Go issue tracker (https://github.com/golang/go/issues/41099) to see if they can help.

If you are able to test with the latest 1.8.x release (which was built with go1.14.x) that might help as well.

dnephin commented 4 years ago

It sounds like we will need to try to reproduce with go1.14.x or go1.15, since go1.13.x is no longer supported with the release of go1.15.

I built a version of Consul 1.7.7 using go1.14.7. You can find those binaries built in CI here: https://app.circleci.com/pipelines/github/hashicorp/consul/12178/workflows/c0691c42-089a-4e26-b966-8d9ae1dcd8c9/jobs/229429/artifacts

Note that these are not official release binaries, but the only change from the official release is the change in Go version.

fanatl commented 4 years ago

Thanks for the help.

Installed the consul indicated on your link, we are watching the work.

fanatl commented 4 years ago

Unfortunately the reboots are still going on.

Found a dependency. Service reboots occur only on hosts with Intel Optane connected in RAM mode.

Probably this is a Go runtime issue.

dnephin commented 4 years ago

Ah, good find!

If you can provide logs from the binary built with go1.14.7 I will update the issue I opened on the golang issue tracker (https://github.com/golang/go/issues/41099). They may be able to help find the problem.

fanatl commented 4 years ago

Sure, log attached.

consul-agent-kn-0030_2020.09.02_trace.log

fanatl commented 4 years ago

I have attached the logs in a previous post. Could you please update issue (golang/go#41099).

fanatl commented 3 years ago

@dnephin Any news on this issue?