Failed to restore 700gb+ db with wal-e

t0k4rt commented 5 years ago

Environment

Linux

Stolon version 0.13

Expected behaviour you didn't see

Cluster is inistialized with init mode PITR and wal-e credentials in order to fetch remote partitions and wal files Cluster starts as expected

Unexpected behaviour you saw

Postgresql is still importing wal files but the keeper throws an error and the restoration fails

postgres + wal-e logs:

wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2019-07-24T15:35:36.981494-00 pid=31606 action=wal-fetch key=swift://wale-backup/wal_005/000000030000037A000000A3.lzo prefix= seg=000000030000037A000000A3
2019-07-24 15:35:37 UTC LOG:  restored log file "000000030000037A000000A3" from archive

Keeper error:

ERROR   cmd/keeper.go:1163  recovery not finished   {"error": "timeout waiting for db recovery"}
ERROR   cmd/keeper.go:1006  db failed to initialize or resync
ERROR   cmd/keeper.go:641   cannot get configured pg parameters {"error": "pq: the database system is shutting down"}

I tried to start a postgresql cluster that is not managed by stolon, it worked as expected, the cluster started.

Steps to reproduce the problem

the database make more than 5h to restore, and takes more than 700gb on disk. I think it's the main issue

t0k4rt commented 5 years ago

I saw the new option "DBWaitReadyTimeout" in the laster documentation, do you know when will it be available ?

sgotti commented 5 years ago

@t0k4rt you should try to increase the cluster spec parameter called syncTimeout (defaults to 30 minutes). Looks likes it's not documented (and it'll probably require a better name).

t0k4rt commented 5 years ago

Thanks a lot ! I'll try that !

maksm90 commented 5 years ago

Hi @sgotti !

@t0k4rt you should try to increase the cluster spec parameter called syncTimeout (defaults to 30 minutes). Looks likes it's not documented (and it'll probably require a better name).

In the long term it makes sense to implement progress monitoring of such background jobs (using external tools such as lsof for recovery worker) and apply timeouts only for stuck tasks.

johannesboon commented 4 years ago

Would someone please enhance the documention of Stolon to include the purpose en default value (30 minutes) of syncTimout in relation to pitr?

Maybe here:

https://github.com/sorintlab/stolon/blob/master/doc/cluster_spec.md https://github.com/sorintlab/stolon/blob/master/doc/pitr.md https://github.com/sorintlab/stolon/blob/master/doc/pitr_wal-e.md https://github.com/sorintlab/stolon/blob/master/doc/pitr_wal-g.md

As we also had the unpleasant experience of discovering this timeout during the point in time recovery of full backup + recovery of around 24 hours worth of WAL took longer than half an hour.

We used a 24 hour timeout like this:

stolonctl --cluster-name primary-postgres-cluster --store-endpoints ... --log-level info --store-backend etcdv3 init '{
     "syncTimeout": "24h",
     "initMode": "pitr",
     "failInterval": "2m0s",
     "synchronousReplication": true,
     "usePgrewind": true,
     "pitrConfig": {
         "dataRestoreCommand": "envdir /etc/wal-e.d/env wal-e backup-fetch %d LATEST",
         "archiveRecoverySettings": {
             "restoreCommand": "envdir /etc/wal-e.d/env wal-e wal-fetch \"%f\" \"%p\"",
             "recoveryTargetSettings": { "recoveryTargetTime": "2019-12-31 01:02:03" }
         }
     },
     "pgParameters": {
         "max_connections": "1000",
         "shared_buffers": "512MB",
         "local_preload_libraries": "...",
         "extwlist.extensions": "..."
     }
}'

sgotti commented 4 years ago

@johannesboon Feel free to open an RFE issue to request this to be documented and also a PR to add this to the doc.

sorintlab / stolon