Open t0k4rt opened 5 years ago
I saw the new option "DBWaitReadyTimeout" in the laster documentation, do you know when will it be available ?
@t0k4rt you should try to increase the cluster spec parameter called syncTimeout
(defaults to 30 minutes). Looks likes it's not documented (and it'll probably require a better name).
Thanks a lot ! I'll try that !
Hi @sgotti !
@t0k4rt you should try to increase the cluster spec parameter called
syncTimeout
(defaults to 30 minutes). Looks likes it's not documented (and it'll probably require a better name).
In the long term it makes sense to implement progress monitoring of such background jobs (using external tools such as lsof
for recovery worker) and apply timeouts only for stuck tasks.
Would someone please enhance the documention of Stolon to include the purpose en default value (30 minutes) of syncTimout in relation to pitr?
Maybe here:
https://github.com/sorintlab/stolon/blob/master/doc/cluster_spec.md https://github.com/sorintlab/stolon/blob/master/doc/pitr.md https://github.com/sorintlab/stolon/blob/master/doc/pitr_wal-e.md https://github.com/sorintlab/stolon/blob/master/doc/pitr_wal-g.md
As we also had the unpleasant experience of discovering this timeout during the point in time recovery of full backup + recovery of around 24 hours worth of WAL took longer than half an hour.
We used a 24 hour timeout like this:
stolonctl --cluster-name primary-postgres-cluster --store-endpoints ... --log-level info --store-backend etcdv3 init '{
"syncTimeout": "24h",
"initMode": "pitr",
"failInterval": "2m0s",
"synchronousReplication": true,
"usePgrewind": true,
"pitrConfig": {
"dataRestoreCommand": "envdir /etc/wal-e.d/env wal-e backup-fetch %d LATEST",
"archiveRecoverySettings": {
"restoreCommand": "envdir /etc/wal-e.d/env wal-e wal-fetch \"%f\" \"%p\"",
"recoveryTargetSettings": { "recoveryTargetTime": "2019-12-31 01:02:03" }
}
},
"pgParameters": {
"max_connections": "1000",
"shared_buffers": "512MB",
"local_preload_libraries": "...",
"extwlist.extensions": "..."
}
}'
@johannesboon Feel free to open an RFE issue to request this to be documented and also a PR to add this to the doc.
Environment
Linux
Stolon version 0.13
Expected behaviour you didn't see
Cluster is inistialized with init mode PITR and wal-e credentials in order to fetch remote partitions and wal files Cluster starts as expected
Unexpected behaviour you saw
Postgresql is still importing wal files but the keeper throws an error and the restoration fails
postgres + wal-e logs:
Keeper error:
I tried to start a postgresql cluster that is not managed by stolon, it worked as expected, the cluster started.
Steps to reproduce the problem
the database make more than 5h to restore, and takes more than 700gb on disk. I think it's the main issue