sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.66k stars 447 forks source link

3 keeper cluster, one dead keeper not restarted #153

Closed wchrisjohnson closed 8 years ago

wchrisjohnson commented 8 years ago

We have a stolon cluster that gets deployed within our environment, which sits on top of Kubernetes. Quite frequently, when we deploy our cluster, two of the three keepers come up, join the cluster, and work fine. The third node:

2016-10-06 16:14:09.818702 [keeper.go:1049] I | keeper: id: f2530f38
2016-10-06 16:14:09.818729 [keeper.go:1052] I | keeper: running under kubernetes.
2016-10-06 16:14:09.845721 [postgresql.go:177] I | postgresql: Stopping database
2016-10-06 16:15:09.915870 [postgresql.go:177] I | postgresql: Stopping database

The keeper keeps rebooting over and over ...

sgotti commented 8 years ago

@wchrisjohnson The first Stopping database log entry should happen at keeper start but since there aren't other info entries between them I'm not sure where the second of this entry comes from.

Can you provide the keeper and sentinel logs with --debug enabled? Have you changed some options in the cluster config (besides init_with_multiple_keepers)?

wchrisjohnson commented 8 years ago

I can and will provide the logs @sgotti - the init_with_multiple_keepers is the only option we have changed.

{
    "init_with_multiple_keepers": true
}
wchrisjohnson commented 8 years ago

It's not much more info, @sgotti but here it is:

keeper log

2016-10-06 21:42:58.094763 [pkg_logger.go:114] W | keeper: password file /etc/secrets/replication permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-10-06 21:42:58.094837 [pkg_logger.go:114] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-10-06 21:42:58.094891 [pkg_logger.go:134] I | keeper: id: 31966b92
2016-10-06 21:42:58.094907 [pkg_logger.go:134] I | keeper: running under kubernetes.
2016-10-06 21:42:58.119022 [pkg_logger.go:147] D | keeper: clusterView: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:42:58.119060 [pkg_logger.go:147] D | keeper: clusterConfig: (*cluster.Config){RequestTimeout:(time.Duration)10s SleepInterval:(time.Duration)5s KeeperFailInterval:(time.Duration)20s MaxStandbysPerSender:(uint)3 SynchronousReplication:(bool)false InitWithMultipleKeepers:(bool)true UsePGRewind:(bool)false PGParameters:(map[string]string)map[]}
2016-10-06 21:42:58.119215 [pkg_logger.go:134] I | postgresql: Stopping database
2016-10-06 21:42:58.121831 [pkg_logger.go:147] D | keeper: got signal: interrupt
2016-10-06 21:43:58.188227 [pkg_logger.go:147] D | keeper: stopping stolon keeper
2016-10-06 21:43:58.188338 [pkg_logger.go:134] I | postgresql: Stopping database

Sentinel Log

2016-10-06 21:51:44.040097 [pkg_logger.go:147] D | sentinel: newcv: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:49.048418 [pkg_logger.go:147] D | sentinel: keepersState: (cluster.KeepersState)map[5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}]
2016-10-06 21:51:49.048500 [pkg_logger.go:147] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:49.048551 [pkg_logger.go:147] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)ce03d3cf ListenAddress:(string)172.18.10.25 Port:(string)6431}
2016-10-06 21:51:49.052108 [pkg_logger.go:147] D | sentinel: using store discovery
logger: write failed: Connection refused
2016-10-06 21:51:49.055872 [pkg_logger.go:147] D | sentinel: keepersDiscoveryInfo: (cluster.KeepersDiscoveryInfo)[<*>{ListenAddress:(string)172.18.70.21 Port:(string)5431} <*>{ListenAddress:(string)172.18.10.20 Port:(string)5431}]
2016-10-06 21:51:49.056867 [pkg_logger.go:147] D | sentinel: keepersInfo: (cluster.KeepersInfo)map[5f2b7590:<*>{ID:(string)5f2b7590 ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432} 417cadde:<*>{ID:(string)417cadde ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432}]
2016-10-06 21:51:49.057593 [pkg_logger.go:147] D | sentinel: keepersPGState: (map[string]*cluster.PostgresState)map[5f2b7590:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]} 417cadde:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}]
2016-10-06 21:51:49.057693 [pkg_logger.go:147] D | sentinel: newKeepersState: (cluster.KeepersState)map[5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}]
2016-10-06 21:51:49.057738 [pkg_logger.go:147] D | sentinel: masterState: (*cluster.KeeperState){ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}
2016-10-06 21:51:49.057837 [pkg_logger.go:147] D | sentinel: newcv: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde} 31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:54.065882 [pkg_logger.go:147] D | sentinel: keepersState: (cluster.KeepersState)map[31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}}]
2016-10-06 21:51:54.065960 [pkg_logger.go:147] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)4 Master:(string)417cadde KeepersRole:(cluster.KeepersRole)map[31966b92:<*>{ID:(string)31966b92 Follow:(string)417cadde} 417cadde:<*>{ID:(string)417cadde Follow:(string)} 5f2b7590:<*>{ID:(string)5f2b7590 Follow:(string)417cadde}] ProxyConf:(*cluster.ProxyConf){Host:(string)172.18.70.21 Port:(string)5432} Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)true UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)2016-10-06 21:03:54.811603438 +0000 UTC}
2016-10-06 21:51:54.066000 [pkg_logger.go:147] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)ce03d3cf ListenAddress:(string)172.18.10.25 Port:(string)6431}
2016-10-06 21:51:54.069850 [pkg_logger.go:147] D | sentinel: using store discovery
logger: write failed: Connection refused
2016-10-06 21:51:54.073252 [pkg_logger.go:147] D | sentinel: keepersDiscoveryInfo: (cluster.KeepersDiscoveryInfo)[<*>{ListenAddress:(string)172.18.70.21 Port:(string)5431} <*>{ListenAddress:(string)172.18.10.20 Port:(string)5431}]
2016-10-06 21:51:54.074331 [pkg_logger.go:147] D | sentinel: keepersInfo: (cluster.KeepersInfo)map[5f2b7590:<*>{ID:(string)5f2b7590 ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432} 417cadde:<*>{ID:(string)417cadde ClusterViewVersion:(int)4 ListenAddress:(string)172.18.70.21 Port:(string)5431 PGListenAddress:(string)172.18.70.21 PGPort:(string)5432}]
2016-10-06 21:51:54.075296 [pkg_logger.go:147] D | sentinel: keepersPGState: (map[string]*cluster.PostgresState)map[5f2b7590:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]} 417cadde:<*>{Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50343784 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}]
2016-10-06 21:51:54.075460 [pkg_logger.go:147] D | sentinel: newKeepersState: (cluster.KeepersState)map[5f2b7590:<*>{ID:(string)5f2b7590 ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress:(string)172.18.10.20 Port:(string)5431 PGListenAddress:(string)172.18.10.20 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449658295033882 TimelineID:(uint64)1 XLogPos:(uint64)23757144 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 31966b92:<*>{ID:(string)31966b92 ErrorStartTime:(time.Time)2016-10-06 21:03:29.698090049 +0000 UTC Healthy:(bool)false ClusterViewVersion:(int)3 ListenAddress:(string)172.18.10.22 Port:(string)5431 PGListenAddress:(string)172.18.10.22 PGPort:(string)5432 PGState:(*cluster.PostgresState){Initialized:(bool)true Role:(common.Role)master SystemID:(string)6338449635358691354 TimelineID:(uint64)1 XLogPos:(uint64)50341680 TimelinesHistory:(cluster.PostgresTimeLinesHistory)[]}} 417cadde:<*>{ID:(string)417cadde ErrorStartTime:(time.Time)0001-01-01 00:00:00 +0000 UTC Healthy:(bool)true ClusterViewVersion:(int)4 ListenAddress...
wchrisjohnson commented 8 years ago

@sgotti I should add that my stolon cluster has 1 proxy node, 3 keeper nodes and 1 sentinel node.

wchrisjohnson commented 8 years ago

One other comment is that this FREQUENTLY happens when we perform an upgrade, and Our existing cluster gets replaced with new nodes - the underlying volumes (which contain the database files) are reattached when the new nodes come up.

sgotti commented 8 years ago

2016-10-06 21:42:58.121831 [pkg_logger.go:147] D | keeper: got signal: interrupt

Well, this entry (that should be moved to info level) means that something sent a sigterm to the keeper.

Since you said that this is running inside k8s probably the pod/container is being restarted by the kubelet. I don't know how your k8s deployment is defined but you should look at the k8s logs/events to understand why it's restarting the container (e.g. do you have some probes on the keepers pods?).

sgotti commented 8 years ago

@wchrisjohnson I think this isn't related to a stolon problem since from the logs we can see that something is sending a sigterm to the keeper. Closing for the moment. Feel free to reopen if you have additional information.