Open fanatl opened 4 years ago
Yes, I have same problem.
Thank you for the report! This sounds like it may be an issue with the go runtime.
For anyone who has hit this problem, which Linux kernel version are you using (uname -a
) ?
Release v1.7.2 was built with go1.13.7. This Go issue seems like it might be related: https://github.com/golang/go/issues/35777
I believe this was fixed in go1.14, which we use to build the v1.8.x releases. Upgrading to 1.8.x may resolve the problem.
Later v1.7.x releases (ex: 1.7.7) were also built with newer version of go, which may also include the fix.
uname -a
Linux 4.14.35-1818.3.3.el7uek.x86_64 #2 SMP Mon Sep 24 14:45:01 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Version 1.7.7 is installed. Reboots are still happening.
Detailed logs are attached.
consul-server-2_2020.08.28.log consul-agent-kn-0033_2020.08.28.log consul-agent-kn_0030_2020.08.28.log
That you for the report and the logs! I'm not sure what is happening here, but from what I can tell it is an issue with the Go runtime. I've opened an issue on the Go issue tracker (https://github.com/golang/go/issues/41099) to see if they can help.
If you are able to test with the latest 1.8.x
release (which was built with go1.14.x) that might help as well.
It sounds like we will need to try to reproduce with go1.14.x or go1.15, since go1.13.x is no longer supported with the release of go1.15.
I built a version of Consul 1.7.7 using go1.14.7. You can find those binaries built in CI here: https://app.circleci.com/pipelines/github/hashicorp/consul/12178/workflows/c0691c42-089a-4e26-b966-8d9ae1dcd8c9/jobs/229429/artifacts
Note that these are not official release binaries, but the only change from the official release is the change in Go version.
Thanks for the help.
Installed the consul indicated on your link, we are watching the work.
Unfortunately the reboots are still going on.
Found a dependency. Service reboots occur only on hosts with Intel Optane connected in RAM mode.
Probably this is a Go runtime issue.
Ah, good find!
If you can provide logs from the binary built with go1.14.7 I will update the issue I opened on the golang issue tracker (https://github.com/golang/go/issues/41099). They may be able to help find the problem.
Sure, log attached.
I have attached the logs in a previous post. Could you please update issue (golang/go#41099).
@dnephin Any news on this issue?
Overview of the Issue
After upgrading from version 1.4.1 to 1.7.2 consul agent periodically restarts or hangs
Reproduction Steps
Consul v1.7.2 3 servers 254 agents
Consul info for both Client and Server
Client info
``` agent: check_monitors = 0 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 9ea1a204 version = 1.7.2 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 72 goroutines = 96 max_procs = 72 os = linux version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 295 failed = 2 health_score = 0 intent_queue = 0 left = 0 member_time = 914365 members = 256 query_queue = 0 query_time = 607 ```Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 6 services = 6 build: prerelease = revision = 9ea1a204 version = 1.7.2 consul: acl = disabled bootstrap = false known_datacenters = 6 leader = false leader_addr = 172.16.200.32:8300 server = true raft: applied_index = 364750949 commit_index = 364750949 fsm_pending = 0 last_contact = 15.676487ms last_log_index = 364750949 last_log_term = 16 last_snapshot_index = 364740740 last_snapshot_term = 16 latest_configuration = [{Suffrage:Voter ID:19c90ce8-ed90-ec59-bcb5-f3c2373fe6d2 Address:172.16.200.53:8300} {Suffrage:Voter ID:609cd8f2-b630-1b49-dc2f-db5889c72d42 Address:172.16.200.32:8300} {Suffrage:Voter ID:1b8a5854-e5e9-5072-e855-90c0758973aa Address:172.16.200.11:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 16 runtime: arch = amd64 cpu_count = 48 goroutines = 784 max_procs = 48 os = linux version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 295 failed = 3 health_score = 0 intent_queue = 0 left = 0 member_time = 914366 members = 257 query_queue = 0 query_time = 607 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 3418 members = 20 query_queue = 0 query_time = 34 ```Operating system and Environment details
OS: Oracle Linux Server release 7.6
Architecture: x86_64
Procinfo
``` processor : 71 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz stepping : 4 microcode : 0x2000065 cpu MHz : 2999.876 cache size : 25344 KB physical id : 1 siblings : 36 core id : 27 cpu cores : 18 apicid : 119 initial apicid : 119 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf bogomips : 4617.46 clflush size : 64 cache_alignment : 64 address sizes : 47 bits physical, 48 bits virtual power management: ```Meminfo
``` MemTotal: 1053580972 kB MemFree: 10921816 kB MemAvailable: 994938308 kB Buffers: 67404 kB Cached: 990592488 kB SwapCached: 0 kB Active: 338375648 kB Inactive: 671238784 kB Active(anon): 20888036 kB Inactive(anon): 2275220 kB Active(file): 317487612 kB Inactive(file): 668963564 kB Unevictable: 13740 kB Mlocked: 13740 kB SwapTotal: 16777212 kB SwapFree: 16777212 kB Dirty: 2711660 kB Writeback: 0 kB AnonPages: 18685476 kB Mapped: 1926308 kB Shmem: 4209344 kB Slab: 30215320 kB SReclaimable: 29884528 kB SUnreclaim: 330792 kB KernelStack: 24944 kB PageTables: 100816 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 543567696 kB Committed_AS: 25208224 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 17924096 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7847596 kB DirectMap2M: 584421376 kB DirectMap1G: 480247808 kB ```Log Fragments