Closed wchrisjohnson closed 8 years ago
@wchrisjohnson The first Stopping database
log entry should happen at keeper start but since there aren't other info entries between them I'm not sure where the second of this entry comes from.
Can you provide the keeper and sentinel logs with --debug
enabled? Have you changed some options in the cluster config (besides init_with_multiple_keepers
)?
I can and will provide the logs @sgotti - the init_with_multiple_keepers
is the only option we have changed.
{
"init_with_multiple_keepers": true
}
It's not much more info, @sgotti but here it is:
keeper log
2016-10-06 21:42:58.094763 [pkg_logger.go:114] W | keeper: password file /etc/secrets/replication permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-10-06 21:42:58.094837 [pkg_logger.go:114] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-10-06 21:42:58.094891 [pkg_logger.go:134] I | keeper: id: 31966b92
2016-10-06 21:42:58.094907 [pkg_logger.go:134] I | keeper: running under kubernetes.
2016-10-06 21:42:58.119022 [pkg_logger.go:147] D | keeper: clusterView: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:42:58.119060 [pkg_logger.go:147] D | keeper: clusterConfig: (*cluster.Config){RequestTimeout:(time.Duration)10s SleepInterval:(time.Duration)5s KeeperFailInterval:(time.Duration)20s MaxStandbysPerSender:(uint)3 SynchronousReplication:(bool)false InitWithMultipleKeepers:(bool)true UsePGRewind:(bool)false PGParameters:(map[string]string)map[]}
2016-10-06 21:42:58.119215 [pkg_logger.go:134] I | postgresql: Stopping database
2016-10-06 21:42:58.121831 [pkg_logger.go:147] D | keeper: got signal: interrupt
2016-10-06 21:43:58.188227 [pkg_logger.go:147] D | keeper: stopping stolon keeper
2016-10-06 21:43:58.188338 [pkg_logger.go:134] I | postgresql: Stopping database
Sentinel Log
2016-10-06 21:51:44.040097 [pkg_logger.go:147] D | sentinel: newcv: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:49.048418 [pkg_logger.go:147] D | sentinel: keepersState: (cluster.KeepersState)map[5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}]
2016-10-06 21:51:49.048500 [pkg_logger.go:147] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:49.048551 [pkg_logger.go:147] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)ce03d3cf ListenAddress:(string)172.18.10.25 Port:(string)6431}
2016-10-06 21:51:49.052108 [pkg_logger.go:147] D | sentinel: using store discovery
logger: write failed: Connection refused
2016-10-06 21:51:49.055872 [pkg_logger.go:147] D | sentinel: keepersDiscoveryInfo: (cluster.KeepersDiscoveryInfo)[<*>{ListenAddress:(string)172.18.70.21 Port:(string)5431} <*>{ListenAddress:(string)172.18.10.20 Port:(string)5431}]
2016-10-06 21:51:49.056867 [pkg_logger.go:147] D | sentinel: keepersInfo: (cluster.KeepersInfo)map[5f2b7590:<*>{ID:(string)5f2b7590 ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432} 417cadde:<*>{ID:(string)417cadde ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432}]
2016-10-06 21:51:49.057593 [pkg_logger.go:147] D | sentinel: keepersPGState: (map[string]*cluster.PostgresState)map[5f2b7590:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]} 417cadde:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}]
2016-10-06 21:51:49.057693 [pkg_logger.go:147] D | sentinel: newKeepersState: (cluster.KeepersState)map[5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}]
2016-10-06 21:51:49.057738 [pkg_logger.go:147] D | sentinel: masterState: (*cluster.KeeperState){ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}
2016-10-06 21:51:49.057837 [pkg_logger.go:147] D | sentinel: newcv: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde} 31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:54.065882 [pkg_logger.go:147] D | sentinel: keepersState: (cluster.KeepersState)map[31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}]
2016-10-06 21:51:54.065960 [pkg_logger.go:147] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:54.066000 [pkg_logger.go:147] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)ce03d3cf ListenAddress:(string)172.18.10.25 Port:(string)6431}
2016-10-06 21:51:54.069850 [pkg_logger.go:147] D | sentinel: using store discovery
logger: write failed: Connection refused
2016-10-06 21:51:54.073252 [pkg_logger.go:147] D | sentinel: keepersDiscoveryInfo: (cluster.KeepersDiscoveryInfo)[<*>{ListenAddress:(string)172.18.70.21 Port:(string)5431} <*>{ListenAddress:(string)172.18.10.20 Port:(string)5431}]
2016-10-06 21:51:54.074331 [pkg_logger.go:147] D | sentinel: keepersInfo: (cluster.KeepersInfo)map[5f2b7590:<*>{ID:(string)5f2b7590 ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432} 417cadde:<*>{ID:(string)417cadde ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432}]
2016-10-06 21:51:54.075296 [pkg_logger.go:147] D | sentinel: keepersPGState: (map[string]*cluster.PostgresState)map[5f2b7590:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]} 417cadde:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}]
2016-10-06 21:51:54.075460 [pkg_logger.go:147] D | sentinel: newKeepersState: (cluster.KeepersState)map[5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress...
@sgotti I should add that my stolon cluster has 1 proxy node, 3 keeper nodes and 1 sentinel node.
One other comment is that this FREQUENTLY happens when we perform an upgrade, and Our existing cluster gets replaced with new nodes - the underlying volumes (which contain the database files) are reattached when the new nodes come up.
2016-10-06 21:42:58.121831 [pkg_logger.go:147] D | keeper: got signal: interrupt
Well, this entry (that should be moved to info level) means that something sent a sigterm to the keeper.
Since you said that this is running inside k8s probably the pod/container is being restarted by the kubelet. I don't know how your k8s deployment is defined but you should look at the k8s logs/events to understand why it's restarting the container (e.g. do you have some probes on the keepers pods?).
@wchrisjohnson I think this isn't related to a stolon problem since from the logs we can see that something is sending a sigterm to the keeper. Closing for the moment. Feel free to reopen if you have additional information.
We have a stolon cluster that gets deployed within our environment, which sits on top of Kubernetes. Quite frequently, when we deploy our cluster, two of the three keepers come up, join the cluster, and work fine. The third node:
The keeper keeps rebooting over and over ...