toni-moreno / syncflux

SyncFlux is an Open Source InfluxDB Data synchronization and replication tool for migration purposes or HA clusters
MIT License
154 stars 34 forks source link

Fail to get response from query select * (and long pauses) #28

Closed kkruzich closed 4 years ago

kkruzich commented 5 years ago

I'm using your syncflux tool (https://github.com/toni-moreno/syncflux) to get a full, up to date copy of a fairly large db (300Gb). While I have some questions about use case, I'm more immediately concerned about these errors. I'm doing a 'syncflux -action fullcopy.' I don't have any options in my syncflux.toml configuration (only servers defined) and the Influx configuration on both master/slave is default.

In running syncflux I'll see some data written to the receiving side, but then everything pauses for 30 seconds or more.

PLEASE SEE LOG IN RECENT COMMENT

I see the following in the logs:

time="2019-10-09 14:50:57" level=info msg="CFG :&{General:{InstanceID: LogDir:./log HomeDir: DataDir: LogLevel:debug SyncMode:onlyslave CheckInterval:10s MinSyncInterval:20s MasterDB:influxdb01 SlaveDB:influxdb02 InitialReplication:none MonitorRetryInterval:1m0s DataChunkDuration:5m0s MaxRetentionInterval:8760h0m0s RWMaxRetries:5 RWRetryDelay:10s NumWorkers:4 MaxPointsOnSingleWrite:20000} HTTP:{BindAddr:127.0.0.1:4090 AdminUser:admin AdminPassword:admin CookieID:mysupercokie} InfluxArray:[0xc00006fec0 0xc00006ff80]}" time="2019-10-09 14:50:57" level=info msg="Set Master DB influxdb01 from Command Line parameters" time="2019-10-09 14:50:57" level=info msg="Set Slave DB influxdb02 from Command Line parameters" time="2019-10-09 14:51:09" level=warning msg="Fail to get response from query select from \"vsphere_host_sys\" where time > 1570657557s and time < 1570657857s group by on [telegraf|autogen] in attempt 1 / read database error: " time="2019-10-09 14:51:09" level=warning msg="Trying again... in 10s sec" time="2019-10-09 14:51:29" level=warning msg="Fail to get response from query select from \"vsphere_host_sys\" where time > 1570657557s and time < 1570657857s group by on [telegraf|autogen] in attempt 2 / read database error: " time="2019-10-09 14:51:29" level=warning msg="Trying again... in 10s sec" time="2019-10-09 14:51:49" level=warning msg="Fail to get response from query select from \"vsphere_host_sys\" where time > 1570657557s and time < 1570657857s group by on [telegraf|autogen] in attempt 3 / read database error: " time="2019-10-09 14:51:49" level=warning msg="Trying again... in 10s sec" time="2019-10-09 14:52:09" level=warning msg="Fail to get response from query select from \"vsphere_host_sys\" where time > 1570657557s and time < 1570657857s group by on [telegraf|autogen] in attempt 4 / read database error: " time="2019-10-09 14:52:09" level=warning msg="Trying again... in 10s sec" time="2019-10-09 14:52:29" level=warning msg="Fail to get response from query select from \"vsphere_host_sys\" where time > 1570657557s and time < 1570657857s group by on [telegraf|autogen] in attempt 5 / read database error: " time="2019-10-09 14:52:29" level=warning msg="Trying again... in 10s sec" time="2019-10-09 14:52:39" level=error msg="Max Retries (5) exceeded on read Data: Last error " time="2019-10-09 14:52:39" level=error msg="error in read DB telegraf | Measurement vsphere_host_sys | ERR: " time="2019-10-09 14:52:39" level=warning msg="Initializing Recovery for 1 chunks" time="2019-10-09 14:52:39" level=warning msg="Recovery for Bad Chunk 1/1 from [1570657557][2019-10-09 14:45:57 -0700 PDT] to [1570657857][2019-10-09 14:50:57 -0700 PDT] (88763) Points Took [1m40.027365462s] ERRORS[R:1|W:0]"


10.41.86.23 - admin [09/Oct/2019:22:02:50 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 891a12ff-eae0-11e9-8205-02cdb5175738 107 10.41.86.23 - admin [09/Oct/2019:22:02:50 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 893e597a-eae0-11e9-8206-02cdb5175738 59 10.41.86.23 - admin [09/Oct/2019:22:02:50 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 8948c0a6-eae0-11e9-8207-02cdb5175738 50 10.41.86.23 - admin [09/Oct/2019:22:02:50 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 89534a09-eae0-11e9-8208-02cdb5175738 44 10.41.86.23 - admin [09/Oct/2019:22:02:50 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 8959e0c7-eae0-11e9-8209-02cdb5175738 35

kkruzich commented 5 years ago

I'm attaching a more recent run of syncflux here. This may be a better example of the issues I'm seeing. syncflux-error.txt

toni-moreno commented 5 years ago

hi @kkruzich , on big databases "select" should take long of time.

Be sure you have not a low http timeout

https://github.com/toni-moreno/syncflux/blob/0cb20ae784fd5ec8b7c69225782c41a0ad0d0644/conf/sample.syncflux.toml#L152

kkruzich commented 5 years ago

Thank you. This advice was very helpful. I was able to get a syncflux running for an extended period of time. However, I'm still having some trouble --it seems syncflux is stalling after some time and there is nothing in the logs. It isn't clear if resources were exhausted with either syncflux or the receiving DB. My setting data-chuck-duration = "5m" seems very low. Would increasing this make for a faster transfer?

If duplicate data is sent, is the data on the receiving side overwritten or is there a failure?

And lastly, as these are usage questions, is there a better place to be asking?

Thank you!

toni-moreno commented 5 years ago

Hi @kkruzich be carefull with configuration when doing big database copies. Read please the "Important Notes" on this document (https://github.com/toni-moreno/syncflux#run-as-a-database-replication-tool) .

increase data-chunk-duration could increase transfer rate, but could also need more Memory on both sides , the primare DB and also the syncflux process.

If duplicate data is sent, all is ok, datapoints are not duplicated ( you can split and restart the copy process as times as you need)

About tool questions, right now there is no better place , you can use the issue to ask us when you need.