4090 port can't be opened after pod restart

like-inspur commented 4 years ago

I install influxdb-srealy and syncflux with influxdb on two hosts by statefulset, when I change config and pod restart, 4090 port of syncflux can't be opened and log of syncflux like below: root@mgt01:~# kubectl log influxdb-0 -n monitoring syncflux

log is DEPRECATED and will be removed in a future version. Use logs instead.
time="2020-09-03 07:33:41" level=info msg="CFG :&{General:{InstanceID: LogDir: HomeDir: DataDir: LogLevel:info SyncMode:onlyslave CheckInterval:10s MinSyncInterval:20s MasterDB:influxdb01 SlaveDB:influxdb02 InitialReplication:both MonitorRetryInterval:30s DataChunkDuration:5m0s MaxRetentionInterval:8760h0m0s RWMaxRetries:5 RWRetryDelay:10s NumWorkers:4 MaxPointsOnSingleWrite:20000} HTTP:{BindAddr:0.0.0.0:4090 AdminUser: AdminPassword: CookieID:} InfluxArray:[0xc0001cd0e0 0xc0001cd200]}"
time="2020-09-03 07:33:41" level=info msg="Set Master DB influxdb01 from Command Line parameters"
time="2020-09-03 07:33:41" level=info msg="Set Slave DB influxdb02 from Command Line parameters"
time="2020-09-03 07:33:41" level=info msg="Set log level to  info from Config File"
time="2020-09-03 07:33:41" level=info msg="Set Default directories : \n   - Exec: \n   - Config: conf\n   -Logs: log\n"
time="2020-09-03 07:33:41" level=info msg="Initializing cluster"
time="2020-09-03 07:33:41" level=info msg="Found MasterDB[influxdb01] in config File &{Release:1x Name:influxdb01 Location:http://influxdb-0.influxdb-svc:8086/ AdminUser: AdminPasswd: Timeout:10s}"
time="2020-09-03 07:33:41" level=error msg="Fail to build newclient to database http://influxdb-0.influxdb-svc:8086/, error: Get http://influxdb-0.influxdb-svc:8086/ping?wait_for_leader=10s: dial tcp: lookup influxdb-0.influxdb-svc on 100.105.0.3:53: server misbehaving\n"
time="2020-09-03 07:33:41" level=error msg="MasterDB[influxdb01] has  problems :Get http://influxdb-0.influxdb-svc:8086/ping?wait_for_leader=10s: dial tcp: lookup influxdb-0.influxdb-svc on 100.105.0.3:53: server misbehaving"
time="2020-09-03 07:33:41" level=info msg="Found SlaveDB[influxdb02] in config File &{Release:1x Name:influxdb02 Location:http://influxdb-1.influxdb-svc:8086/ AdminUser: AdminPasswd: Timeout:10s}"
time="2020-09-03 07:33:41" level=error msg="Master DB is not runing I should wait until both up to begin to chek sync status"
time="2020-09-03 07:34:11" level=info msg="Found MasterDB[influxdb01] in config File &{Release:1x Name:influxdb01 Location:http://influxdb-0.influxdb-svc:8086/ AdminUser: AdminPasswd: Timeout:10s}"
time="2020-09-03 07:34:11" level=error msg="Fail to build newclient to database http://influxdb-0.influxdb-svc:8086/, error: Get http://influxdb-0.influxdb-svc:8086/ping?wait_for_leader=10s: dial tcp: lookup influxdb-0.influxdb-svc on 100.105.0.3:53: server misbehaving\n"
time="2020-09-03 07:34:11" level=error msg="MasterDB[influxdb01] has  problems :Get http://influxdb-0.influxdb-svc:8086/ping?wait_for_leader=10s: dial tcp: lookup influxdb-0.influxdb-svc on 100.105.0.3:53: server misbehaving"
time="2020-09-03 07:34:11" level=info msg="Found SlaveDB[influxdb02] in config File &{Release:1x Name:influxdb02 Location:http://influxdb-1.influxdb-svc:8086/ AdminUser: AdminPasswd: Timeout:10s}"
time="2020-09-03 07:34:11" level=error msg="Master DB is not runing I should wait until both up to begin to chek sync status"
time="2020-09-03 07:34:41" level=info msg="Found MasterDB[influxdb01] in config File &{Release:1x Name:influxdb01 Location:http://influxdb-0.influxdb-svc:8086/ AdminUser: AdminPasswd: Timeout:10s}"
time="2020-09-03 07:34:41" level=info msg="Found SlaveDB[influxdb02] in config File &{Release:1x Name:influxdb02 Location:http://influxdb-1.influxdb-svc:8086/ AdminUser: AdminPasswd: Timeout:10s}"
time="2020-09-03 07:36:27" level=info msg="Replicating DB Schema from Master to Slave"
time="2020-09-03 07:36:27" level=info msg="Replicating DATA Schema from Master to Slave"
time="2020-09-03 07:36:27" level=info msg="Replicating Data from DB prometheus RP autogen...."

like-inspur commented 4 years ago

until today I found 4090 start after all db data has done

time="2020-09-03 23:53:25" level=info msg="Processed DB data from influxdb01[prometheus|autogen] to influxdb02[prometheus|autogen] has done  #Points (1413407956)  Took [16h16m57.381408237s] ERRORS [0]!\n"
time="2020-09-03 23:53:25" level=info msg="Beginning Monitoring process  process for influxdb influxdb02 | http://influxdb-1.influxdb-svc:8086/"
time="2020-09-03 23:53:25" level=info msg="Beginning Monitoring process  process for influxdb influxdb01 | http://influxdb-0.influxdb-svc:8086/"
time="2020-09-03 23:53:25" level=info msg="InfluxMonitor: InfluxDB : influxdb02  OK (Version  1.8.0 : Duration 3.997799ms )"
time="2020-09-03 23:53:25" level=info msg="InfluxMonitor: InfluxDB : influxdb01  OK (Version  1.8.0 : Duration 5.704578ms )"
time="2020-09-03 23:53:35" level=info msg="Beginning Supervision process  process each 20s "
time="2020-09-03 23:53:35" level=info msg="HACluster check...."
time="2020-09-03 23:53:35" level=info msg="Server is running on :0.0.0.0:4090..."

toni-moreno commented 4 years ago

Hello @like-inspur , I can not understand where is the issue can you give us please more context ?

Can you told us how are you working with srelay/syncflux/influxdb ? ( how many srelay/syncflux did you play with your influxdb cluster/s ?
was your data in influxdb cluster ok synced before the configuration change?
did you put in hamonitor mode initial-replication parámeter something different to none? ( https://github.com/toni-moreno/syncflux/blob/master/conf/sample.syncflux.toml#L69-L81 ) if yes, why?

I hope to help you with this extended info.

like-inspur commented 4 years ago

1、I build influxdb cluster with two pods，each pod contains influxdb ,srelay and syncflux container. 2、yes, I use api/health to check influxdb cluster sync data ok, I just want to see once one restart how the cluster behave 3、I config initial-replication with both, becuase I think one influxdb lose some data, this can help recover data

toni-moreno commented 4 years ago

Hello @like-inspur

As described here https://github.com/toni-moreno/syncflux/issues/38 in hamonitor mode the best way to work is assuming a already synced initial state, if doubts you can sync with a external syncflux process once.

I recommend initial-replication=none when working in hamonitor mode

like-inspur commented 4 years ago

suppose that one influxdb was down for a time, when it start after this period, how can another influxdb sync the data for this period for this influxdb, and ensure that external call to influxdb cluster will not be scheduled to this influxdb which is recovering data

toni-moreno commented 4 years ago

I suggest check the cluster state prior any restart , and wait if cluster is not fully ok. You can do this by check both nodes http://node1:4090/api/health and http://node2:4090/api/health

like-inspur commented 4 years ago

if I do that, then I won't be able to ensure if one influxdb down, another can still provide service. And when the dead one is up, it can recover data from normal one. When the dead one has entirely recoverd, it can be added to influxdb cluster to provide service

toni-moreno commented 4 years ago

@like-inspur syncflux /influxdb-srelay has been designed to be running always while influxdb could ( accidentally or not ) be restarted. So I think the mistake is running 3 different services in the same pod all together. de you agree?

like-inspur commented 4 years ago

But syncflux and influxdb-sreay also have the possibility of death, so I put 3 container into the same pod for the convenience of cluster implementation, and config each self is master and another is slave.

toni-moreno commented 4 years ago

Hello @like-inspur I suggest influxdb-srelay and influxdb in the same pod and syncflux as a independent pod , in this case the proposed tool will do its work well and you will be able to change db and relay config when you need.

toni-moreno / syncflux

4090 port can't be opened after pod restart #42