Open 4nte opened 2 years ago
Ran into the same issue right now. Had clean reinstalled Timescale using helm a day back. Earlier this issue didn't arise when I had it running for almost a month. Pretty curious what could have caused this.
Ran into the same issue right now. Had clean reinstalled Timescale using helm a day back. Earlier this issue didn't arise when I had it running for almost a month. Pretty curious what could have caused this.
Can you share logs, if you still have them, so we get more info on this?
Looking at the pgbackrest logs of the master instance, here is the first failed backup, after which all other backups failed too.
-------------------PROCESS START-------------------
2021-10-29 02:12:11.949 P00 INFO: backup command begin 2.32: --compress-level=3 --compress-type=lz4 --config=/etc/pgbackrest/pgbackrest.conf --exec-id=735-c4f0716d --log-level-console=off --log-level-stderr=warn --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --process-max=4 --repo1-cipher-type=none --repo1-path=/staging/timescale --repo1-retention-diff=2 --repo1-retention-full=2 --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --stanza=poddb --start-fast --type=incr
2021-10-29 02:12:12.112 P00 WARN: unable to check pg-1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
2021-10-29 02:12:12.114 P00 ERROR: [056]: unable to find primary cluster - cannot proceed
2021-10-29 02:12:12.114 P00 INFO: backup command end: aborted with exception [056]
Can you share logs, if you still have them, so we get more info on this?
Thanks for your quick reply, I am suspecting in my case, it was something to do with my k8s setup because I have backup disabled but I have 2 Prometheus instances feeding massive amounts of data into Timescale.
For now, I just clean reinstalled and increased sizes for the data (20GB) and wal (10GB) persistent volumes.
If I run into it again I will update it here asap.
@4nte you mentioned that one of the replicas has been promoted. That means the earlier master is no more accessible and maybe the backup was trying to connect to the old master and hence is erroring out? Shouldn't the promoted replica be connected to for the backup since it should be the new master?
@4nte you mentioned that one of the replicas has been promoted. That means the earlier master is no more accessible and maybe the backup was trying to connect to the old master and hence is erroring out? Shouldn't the promoted replica be connected to for the backup since it should be the new master?
timescale-timescaledb-0
is labeled with role=promoted
, but running patronictrl topology gives:
+ Cluster: timescale (uninitialized) ------+---------+---------+-----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------------+--------------+---------+---------+-----+-----------+
| + timescale-timescaledb-0 | 10.244.1.170 | Replica | running | 113 | 0 |
| + timescale-timescaledb-1 | 10.244.2.31 | Replica | running | 96 | 0 |
| + timescale-timescaledb-2 | 10.244.0.104 | Replica | running | 96 | 0 |
+---------------------------+--------------+---------+---------+-----+-----------+
So It's fair to suspect that a promotion has failed mid-process, I don't know why it happened though.
@4nte try this postgresql.fastware.com
After some time, timescale deployment stopped working with a
No space left on device
error. pgrest has been creating backups for about two months, then one day it apparently couldn't connect to timescale and I guess that's why pg_wal partition wasn't getting archived anymore, therefore disk is full?I'm not entirely sure what is the cause of this.
Conclusions: 28.10.2021 was the last successful backup
Using
patronictl toplogy
I can see that all postgres instaces are of role Replica:timescale-timescaledb-0 pod has label
role=promoted
, while others arerole=replica
I'm unable to psql into the postgres pod:
Highlighted error logs
(Full logs available below)
Deployment Timescaledb-single v0.8.2
Chart values:
Pgrest logs
Timescaledb logs
Master timescaledb instance logs
pgbackrest logs
Additional context Kubernetes v1.20.2