timescale / helm-charts

Configuration and Documentation to run TimescaleDB in your Kubernetes cluster
Apache License 2.0
263 stars 223 forks source link

`No space left on device` - All timescale instances are replicas #320

Open 4nte opened 2 years ago

4nte commented 2 years ago

After some time, timescale deployment stopped working with a No space left on device error. pgrest has been creating backups for about two months, then one day it apparently couldn't connect to timescale and I guess that's why pg_wal partition wasn't getting archived anymore, therefore disk is full?

I'm not entirely sure what is the cause of this.

Conclusions: 28.10.2021 was the last successful backup

Using patronictl toplogy I can see that all postgres instaces are of role Replica:

+ Cluster: timescale (uninitialized) ------+---------+---------+-----+-----------+
| Member                    | Host         | Role    | State   |  TL | Lag in MB |
+---------------------------+--------------+---------+---------+-----+-----------+
| + timescale-timescaledb-0 | 10.244.1.170 | Replica | running | 113 |         0 |
| + timescale-timescaledb-1 | 10.244.2.31  | Replica | running |  96 |         0 |
| + timescale-timescaledb-2 | 10.244.0.104 | Replica | running |  96 |         0 |
+---------------------------+--------------+---------+---------+-----+-----------+

timescale-timescaledb-0 pod has label role=promoted, while others are role=replica

I'm unable to psql into the postgres pod:

psql: error: could not connect to server: No such file or directory
    Is the server running locally and accepting
    connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
command terminated with exit code 2

Highlighted error logs

2021-12-07 18:41:03 UTC [31394]: [61afaabd.7aa2-9] @,app= [53100] FATAL:  could not write to file "pg_wal/xlogtemp.31394": No space left on device
ERROR: [064]: unable to write '/var/lib/postgresql/data/pg_wal/RECOVERYHISTORY': [28] No space left on device
ERROR: [064]: unable to write '/var/lib/postgresql/data/pg_wal/RECOVERYXLOG'
2021-11-30 02:13:02,601 - WARNING - backup - WARN: unable to check pg-1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': could not connect to server: No such file or directory

(Full logs available below)

Deployment Timescaledb-single v0.8.2

Chart values:

    backup:
      enabled: true
    postInit:
      - configMap:
          name: timescale-post-init
          optional: false
    patroni:
      bootstrap:
        dcs:
          postgresql:
            parameters:
              password_encryption: scram-sha-256
              ssl: 'off'  # disable all ssl connections to not require Patroni certificate
      postgresql:
        pg_hba:
          - local     all             postgres                            peer           # allow local connections
          - local     all             all                                 scram-sha-256
          - hostnossl all,replication all               all               scram-sha-256  # allow TCP/IP connections (no ssl)
kubectl get all,secret,configmap,endpoints,pvc -L role -l release=$RELEASE

NAME                          READY   STATUS    RESTARTS   AGE   ROLE
pod/timescale-timescaledb-0   1/2     Running   0          64d   promoted
pod/timescale-timescaledb-1   2/2     Running   0          64d   replica
pod/timescale-timescaledb-2   2/2     Running   0          64d   replica

NAME                                   TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)          AGE   ROLE
service/timescale                      LoadBalancer   10.245.213.171   144.126.246.94   5432:31638/TCP   64d   master
service/timescale-config               ClusterIP      None             <none>           8008/TCP         64d   
service/timescale-replica              ClusterIP      10.245.113.208   <none>           5432/TCP         64d   replica
service/timescale-timescaledb-backup   ClusterIP      10.245.46.103    <none>           8081/TCP         64d   

NAME                                     READY   AGE   ROLE
statefulset.apps/timescale-timescaledb   2/3     64d   

NAME                                                    SCHEDULE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE   ROLE
cronjob.batch/timescale-timescaledb-full-weekly         12 02 * * 0     False     0        2d16h           64d   
cronjob.batch/timescale-timescaledb-incremental-daily   12 02 * * 1-6   False     0        16h             64d   

NAME                                         DATA   AGE   ROLE
configmap/timescale-timescaledb-patroni      1      64d   
configmap/timescale-timescaledb-pgbackrest   1      64d   
configmap/timescale-timescaledb-scripts      8      64d   

NAME                                     ENDPOINTS                            AGE   ROLE
endpoints/timescale                      10.244.1.170:5432                    64d   
endpoints/timescale-config               10.244.0.104:8008,10.244.2.31:8008   64d   
endpoints/timescale-replica              10.244.0.104:5432,10.244.2.31:5432   64d   replica
endpoints/timescale-timescaledb-backup   <none>                               64d   

NAME                                                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE   ROLE
persistentvolumeclaim/storage-volume-timescale-timescaledb-0   Bound    pvc-d1dc9e4b-bbd1-4622-8707-43e147476fc2   2Gi        RWO            do-block-storage   64d   
persistentvolumeclaim/storage-volume-timescale-timescaledb-1   Bound    pvc-099d2b96-0df8-4842-b0a0-61b6b110601f   2Gi        RWO            do-block-storage   64d   
persistentvolumeclaim/storage-volume-timescale-timescaledb-2   Bound    pvc-16184868-34a6-415c-a5cb-06478056d378   2Gi        RWO            do-block-storage   64d   
persistentvolumeclaim/wal-volume-timescale-timescaledb-0       Bound    pvc-c60b1bea-76a6-4520-8db0-c46ccc7d724e   1Gi        RWO            do-block-storage   64d   
persistentvolumeclaim/wal-volume-timescale-timescaledb-1       Bound    pvc-f8c85906-cc4a-4d41-b739-4dbf073446d9   1Gi        RWO            do-block-storage   64d   
persistentvolumeclaim/wal-volume-timescale-timescaledb-2       Bound    pvc-567cb8a0-18be-48f8-83bb-ad947df0e87b   1Gi        RWO            do-block-storage   64d 

Pgrest logs

2021-11-30 02:12:35,920 - ERROR - backup - Backup 20211130021235 failed with returncode 56
2021-11-30 02:12:35,920 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 02:12:35,935 - INFO - http - 10.244.1.218 - - "POST /backups/ HTTP/1.1" 500 -
2021-11-30 02:12:36,924 - DEBUG - backup - Waiting until backup triggered
2021-11-30 02:13:02,437 - INFO - backup - Starting backup
2021-11-30 02:13:02,601 - WARNING - backup - WARN: unable to check pg-1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': could not connect to server: No such file or directory
2021-11-30 02:13:02,601 - INFO - backup -           Is the server running locally and accepting
2021-11-30 02:13:02,602 - INFO - backup -           connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
2021-11-30 02:13:02,602 - ERROR - backup - ERROR: [056]: unable to find primary cluster - cannot proceed
2021-11-30 02:13:02,608 - DEBUG - backup - Backup details
{
    "age": 0.0,
    "duration": 0.0,
    "finished": "2021-11-30T02:13:02+00:00",
    "label": "20211130021302",
    "pgbackrest": {},
    "pid": 1564,
    "returncode": 56,
    "started": "2021-11-30T02:13:02+00:00",
    "status": "RUNNING"
}
2021-11-30 02:13:02,612 - ERROR - backup - Backup 20211130021302 failed with returncode 56
2021-11-30 02:13:02,612 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 02:13:02,645 - INFO - http - 10.244.1.218 - - "POST /backups/ HTTP/1.1" 500 -
2021-11-30 02:13:03,616 - DEBUG - backup - Waiting until backup triggered
2021-11-30 03:13:03,831 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 04:13:05,068 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 05:13:06,228 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 06:13:07,456 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 07:13:08,644 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 08:13:11,710 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 09:13:12,932 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 10:13:14,096 - INFO - history - Refreshing backup history using pgbackrest
2021-11-30 11:13:15,240 - INFO - history - Refreshing backup history using pgbackrest

Timescaledb logs

2021-12-07 18:41:24 UTC [31449]: [61afaad4.7ad9-1] @,app= [00000] LOG:  starting PostgreSQL 12.6 (Debian 12.6-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-12-07 18:41:24 UTC [31449]: [61afaad4.7ad9-2] @,app= [00000] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-12-07 18:41:24 UTC [31449]: [61afaad4.7ad9-3] @,app= [00000] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-12-07 18:41:24 UTC [31452]: [61afaad4.7adc-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2021-12-07 18:41:24 UTC [31451]: [61afaad4.7adb-1] @,app= [00000] LOG:  database system was interrupted while in recovery at log time 2021-10-28 11:05:35 UTC
2021-12-07 18:41:24 UTC [31451]: [61afaad4.7adb-2] @,app= [00000] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-12-07 18:41:24 UTC [31452]: [61afaad4.7adc-2] postgres@postgres,app=[unknown] [57P03] FATAL:  the database system is starting up
/var/run/postgresql:5432 - rejecting connections
2021-12-07 18:41:24 UTC [31454]: [61afaad4.7ade-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2021-12-07 18:41:24 UTC [31454]: [61afaad4.7ade-2] postgres@postgres,app=[unknown] [57P03] FATAL:  the database system is starting up
/var/run/postgresql:5432 - rejecting connections
2021-12-07 18:41:25 UTC [31451]: [61afaad4.7adb-3] @,app= [00000] LOG:  entering standby mode
ERROR: [064]: unable to write '/var/lib/postgresql/data/pg_wal/RECOVERYHISTORY': [28] No space left on device
ERROR: [064]: unable to write '/var/lib/postgresql/data/pg_wal/RECOVERYXLOG'
2021-12-07 18:41:25 UTC [31451]: [61afaad4.7adb-4] @,app= [00000] LOG:  redo starts at 2/2C01FF00
2021-12-07 18:41:25 UTC [31451]: [61afaad4.7adb-5] @,app= [00000] LOG:  consistent recovery state reached at 2/2D000000
2021-12-07 18:41:25 UTC [31449]: [61afaad4.7ad9-4] @,app= [00000] LOG:  database system is ready to accept read only connections
2021-12-07 18:41:25 UTC [31473]: [61afaad5.7af1-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2021-12-07 18:41:25 UTC [31473]: [61afaad5.7af1-2] postgres@postgres,app=[unknown] [00000] LOG:  connection authorized: user=postgres database=postgres application_name=pg_isready
/var/run/postgresql:5432 - accepting connections
2021-12-07 18:41:25 UTC [31473]: [61afaad5.7af1-3] postgres@postgres,app=pg_isready [00000] LOG:  disconnection: session time: 0:00:00.028 user=postgres database=postgres host=[local]
2021-12-07 18:41:25 UTC [31475]: [61afaad5.7af3-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2021-12-07 18:41:25 UTC [31475]: [61afaad5.7af3-2] postgres@postgres,app=[unknown] [00000] LOG:  connection authorized: user=postgres database=postgres application_name=Patroni
2021-12-07 18:41:25 UTC [31477]: [61afaad5.7af5-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=127.0.0.1 port=52030
2021-12-07 18:41:25 UTC [31477]: [61afaad5.7af5-2] standby@[unknown],app=[unknown] [00000] LOG:  replication connection authorized: user=standby
2021-12-07 18:41:25 UTC [31477]: [61afaad5.7af5-3] standby@[unknown],app=[unknown] [00000] LOG:  disconnection: session time: 0:00:00.014 user=standby database= host=127.0.0.1 port=52030
server promoting
2021-12-07 18:41:25 UTC [31451]: [61afaad4.7adb-6] @,app= [00000] LOG:  received promote request
2021-12-07 18:41:25 UTC [31451]: [61afaad4.7adb-7] @,app= [00000] LOG:  redo done at 2/2C01FFE8
ERROR: [064]: unable to write '/var/lib/postgresql/data/pg_wal/RECOVERYXLOG'
2021-12-07 18:41:26 UTC [31451]: [61afaad4.7adb-8] @,app= [00000] LOG:  selected new timeline ID: 114
2021-12-07 18:41:26 UTC [31451]: [61afaad4.7adb-9] @,app= [53100] FATAL:  could not write to file "pg_wal/xlogtemp.31451": No space left on device
2021-12-07 18:41:26 UTC [31449]: [61afaad4.7ad9-5] @,app= [00000] LOG:  startup process (PID 31451) exited with exit code 1
2021-12-07 18:41:26 UTC [31449]: [61afaad4.7ad9-6] @,app= [00000] LOG:  terminating any other active server processes
2021-12-07 18:41:26 UTC [31475]: [61afaad5.7af3-3] postgres@postgres,app=Patroni [57P02] WARNING:  terminating connection because of crash of another server process
2021-12-07 18:41:26 UTC [31475]: [61afaad5.7af3-4] postgres@postgres,app=Patroni [57P02] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-12-07 18:41:26 UTC [31475]: [61afaad5.7af3-5] postgres@postgres,app=Patroni [57P02] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2021-12-07 18:41:26 UTC [31449]: [61afaad4.7ad9-7] @,app= [00000] LOG:  database system is shut down

Master timescaledb instance logs

/var/lib/postgresql/wal/pg_wal$ ls -al
total 927948
drwx------ 3 postgres postgres    12288 Dec  8 11:43 .
drwxrwsr-x 4 root     postgres     4096 Oct  4 16:10 ..
-rw------- 1 postgres postgres      373 Oct  7 02:12 000000010000000000000068.00000028.backup
-rw-r----- 1 postgres postgres       42 Oct  8 18:34 00000002.history
-rw------- 1 postgres postgres      372 Oct  8 02:12 000000020000000000000099.00000028.backup
-rw-r----- 1 postgres postgres       85 Oct  8 22:18 00000003.history
-rw-r----- 1 postgres postgres      128 Oct  9 02:26 00000004.history
-rw------- 1 postgres postgres      373 Oct  9 02:12 0000000400000000000000C9.00000028.backup
-rw-r----- 1 postgres postgres      171 Oct  9 06:31 00000005.history
-rw-r----- 1 postgres postgres      214 Oct  9 14:55 00000006.history
-rw-r----- 1 postgres postgres      257 Oct  9 18:43 00000007.history
-rw-r----- 1 postgres postgres      300 Oct  9 22:44 00000008.history
-rw-r----- 1 postgres postgres      343 Oct 10 02:39 00000009.history
-rw------- 1 postgres postgres      373 Oct 10 02:12 0000000900000000000000FA.00000028.backup
-rw-r----- 1 postgres postgres      386 Oct 10 06:50 0000000A.history
-rw-r----- 1 postgres postgres      429 Oct 10 10:51 0000000B.history
-rw-r----- 1 postgres postgres      472 Oct 10 14:42 0000000C.history
-rw-r----- 1 postgres postgres      516 Oct 10 18:42 0000000D.history
-rw-r----- 1 postgres postgres      560 Oct 10 22:47 0000000E.history
-rw-r----- 1 postgres postgres      604 Oct 11 02:37 0000000F.history
-rw------- 1 postgres postgres      375 Oct 11 02:12 0000000F0000000100000028.00000028.backup
-rw-r----- 1 postgres postgres      648 Oct 11 06:52 00000010.history
-rw-r----- 1 postgres postgres      692 Oct 11 10:49 00000011.history
-rw-r----- 1 postgres postgres      736 Oct 11 15:01 00000012.history
-rw-r----- 1 postgres postgres      780 Oct 11 19:21 00000013.history
-rw-r----- 1 postgres postgres      824 Oct 11 23:01 00000014.history
-rw-r----- 1 postgres postgres      868 Oct 12 02:58 00000015.history
-rw------- 1 postgres postgres      375 Oct 12 02:12 000000150000000100000056.00000028.backup
-rw-r----- 1 postgres postgres      912 Oct 12 07:03 00000016.history
-rw-r----- 1 postgres postgres      956 Oct 12 11:07 00000017.history
-rw-r----- 1 postgres postgres     1000 Oct 12 15:02 00000018.history
-rw-r----- 1 postgres postgres     1044 Oct 12 18:58 00000019.history
-rw-r----- 1 postgres postgres     1088 Oct 12 22:53 0000001A.history
-rw-r----- 1 postgres postgres     1132 Oct 13 02:50 0000001B.history
-rw------- 1 postgres postgres      375 Oct 13 02:12 0000001B0000000100000067.00000060.backup
-rw-r----- 1 postgres postgres     1176 Oct 13 06:54 0000001C.history
-rw-r----- 1 postgres postgres     1220 Oct 13 10:56 0000001D.history
-rw-r----- 1 postgres postgres     1264 Oct 13 14:48 0000001E.history
-rw-r----- 1 postgres postgres     1308 Oct 13 18:45 0000001F.history
-rw-r----- 1 postgres postgres     1352 Oct 13 22:57 00000020.history
-rw-r----- 1 postgres postgres     1396 Oct 14 02:46 00000021.history
-rw------- 1 postgres postgres      375 Oct 14 02:12 000000210000000100000074.00000060.backup
-rw-r----- 1 postgres postgres     1440 Oct 14 10:49 00000022.history
-rw-r----- 1 postgres postgres     1484 Oct 14 14:45 00000023.history
-rw-r----- 1 postgres postgres     1528 Oct 14 18:48 00000024.history
-rw-r----- 1 postgres postgres     1572 Oct 14 22:50 00000025.history
-rw-r----- 1 postgres postgres     1616 Oct 15 02:44 00000026.history
-rw------- 1 postgres postgres      375 Oct 15 02:12 00000026000000010000007F.00000060.backup
-rw-r----- 1 postgres postgres     1660 Oct 15 06:39 00000027.history
-rw-r----- 1 postgres postgres     1704 Oct 15 10:39 00000028.history
-rw-r----- 1 postgres postgres     1748 Oct 15 14:27 00000029.history
-rw-r----- 1 postgres postgres     1792 Oct 15 18:44 0000002A.history
-rw-r----- 1 postgres postgres     1836 Oct 15 22:31 0000002B.history
-rw-r----- 1 postgres postgres     1880 Oct 16 02:39 0000002C.history
-rw------- 1 postgres postgres      375 Oct 16 02:12 0000002C0000000100000090.00000028.backup
-rw-r----- 1 postgres postgres     1924 Oct 16 06:27 0000002D.history
-rw-r----- 1 postgres postgres     1968 Oct 16 10:37 0000002E.history
-rw-r----- 1 postgres postgres     2012 Oct 16 14:44 0000002F.history
-rw-r----- 1 postgres postgres     2056 Oct 16 18:41 00000030.history
-rw-r----- 1 postgres postgres     2100 Oct 16 22:34 00000031.history
-rw-r----- 1 postgres postgres     2144 Oct 17 02:09 00000032.history
-rw-r----- 1 postgres postgres     2188 Oct 17 06:30 00000033.history
-rw------- 1 postgres postgres      375 Oct 17 02:12 00000033000000010000009C.00000060.backup
-rw-r----- 1 postgres postgres     2232 Oct 17 10:28 00000034.history
-rw-r----- 1 postgres postgres     2276 Oct 17 14:06 00000035.history
-rw-r----- 1 postgres postgres     2320 Oct 17 22:26 00000036.history
-rw-r----- 1 postgres postgres     2364 Oct 18 02:08 00000037.history
-rw-r----- 1 postgres postgres     2408 Oct 18 06:28 00000038.history
-rw------- 1 postgres postgres      375 Oct 18 02:12 0000003800000001000000A6.00000060.backup
-rw-r----- 1 postgres postgres     2452 Oct 18 09:58 00000039.history
-rw-r----- 1 postgres postgres     2496 Oct 18 13:24 0000003A.history
-rw-r----- 1 postgres postgres     2540 Oct 18 21:29 0000003B.history
-rw-r----- 1 postgres postgres     2584 Oct 19 01:52 0000003C.history
-rw-r----- 1 postgres postgres     2628 Oct 19 06:23 0000003D.history
-rw------- 1 postgres postgres      375 Oct 19 02:12 0000003D00000001000000B3.00000028.backup
-rw-r----- 1 postgres postgres     2672 Oct 19 09:44 0000003E.history
-rw-r----- 1 postgres postgres     2716 Oct 19 13:15 0000003F.history
-rw-r----- 1 postgres postgres     2760 Oct 19 14:12 00000040.history
-rw-r----- 1 postgres postgres     2804 Oct 19 18:14 00000041.history
-rw-r----- 1 postgres postgres     2848 Oct 20 02:07 00000042.history
-rw-r----- 1 postgres postgres     2892 Oct 20 06:09 00000043.history
-rw------- 1 postgres postgres      374 Oct 20 02:12 0000004300000001000000BF.00000028.backup
-rw-r----- 1 postgres postgres     2936 Oct 20 10:33 00000044.history
-rw-r----- 1 postgres postgres     2980 Oct 20 14:18 00000045.history
-rw-r----- 1 postgres postgres     3024 Oct 20 17:38 00000046.history
-rw-r----- 1 postgres postgres     3068 Oct 20 21:24 00000047.history
-rw-r----- 1 postgres postgres     3112 Oct 21 01:20 00000048.history
-rw-r----- 1 postgres postgres     3156 Oct 21 06:03 00000049.history
-rw------- 1 postgres postgres      375 Oct 21 02:12 0000004900000001000000CD.00000060.backup
-rw-r----- 1 postgres postgres     3200 Oct 21 09:52 0000004A.history
-rw-r----- 1 postgres postgres     3244 Oct 21 13:39 0000004B.history
-rw-r----- 1 postgres postgres     3288 Oct 21 16:40 0000004C.history
-rw-r----- 1 postgres postgres     3332 Oct 21 17:40 0000004D.history
-rw-r----- 1 postgres postgres     3376 Oct 21 21:33 0000004E.history
-rw-r----- 1 postgres postgres     3420 Oct 22 01:18 0000004F.history
-rw-r----- 1 postgres postgres     3464 Oct 22 08:13 00000050.history
-rw------- 1 postgres postgres      375 Oct 22 02:12 0000005000000001000000DA.00000060.backup
-rw-r----- 1 postgres postgres     3508 Oct 23 00:18 00000051.history
-rw-r----- 1 postgres postgres     3552 Oct 23 04:48 00000052.history
-rw------- 1 postgres postgres      375 Oct 23 02:12 0000005200000001000000E5.00000060.backup
-rw-r----- 1 postgres postgres     3596 Oct 23 09:03 00000053.history
-rw-r----- 1 postgres postgres     3640 Oct 23 13:35 00000054.history
-rw-r----- 1 postgres postgres     3684 Oct 23 17:18 00000055.history
-rw-r----- 1 postgres postgres     3728 Oct 23 17:19 00000056.history
-rw-r----- 1 postgres postgres     3772 Oct 23 21:28 00000057.history
-rw-r----- 1 postgres postgres     3816 Oct 24 01:21 00000058.history
-rw-r----- 1 postgres postgres     3860 Oct 24 04:55 00000059.history
-rw------- 1 postgres postgres      375 Oct 24 02:12 0000005900000001000000F2.00000028.backup
-rw-r----- 1 postgres postgres     3904 Oct 24 09:23 0000005A.history
-rw-r----- 1 postgres postgres     3948 Oct 24 16:53 0000005B.history
-rw-r----- 1 postgres postgres     3992 Oct 24 21:04 0000005C.history
-rw-r----- 1 postgres postgres     4036 Oct 25 01:02 0000005D.history
-rw-r----- 1 postgres postgres     4080 Oct 25 04:47 0000005E.history
-rw------- 1 postgres postgres      375 Oct 25 02:12 0000005E00000001000000FF.00000060.backup
-rw-r----- 1 postgres postgres     4123 Oct 25 09:01 0000005F.history
-rw------- 1 postgres postgres 16777216 Oct 25 09:01 0000005F0000000200000004.partial
-rw-r----- 1 postgres postgres     4166 Oct 25 12:51 00000060.history
-rw------- 1 postgres postgres 16777216 Oct 25 09:31 000000600000000200000004
-rw------- 1 postgres postgres 16777216 Oct 25 12:51 000000600000000200000005
-rw-r----- 1 postgres postgres 16777216 Oct 25 12:51 000000600000000200000006.partial
-rw-r----- 1 postgres postgres     4209 Oct 25 16:52 00000061.history
-rw-r----- 1 postgres postgres 16777216 Oct 25 16:52 000000610000000200000006
-rw-r----- 1 postgres postgres     4252 Oct 25 20:52 00000062.history
-rw------- 1 postgres postgres 16777216 Oct 25 17:22 000000620000000200000007
-rw------- 1 postgres postgres 16777216 Oct 25 20:52 000000620000000200000008.partial
-rw-r----- 1 postgres postgres     4295 Oct 26 02:52 00000063.history
-rw------- 1 postgres postgres 16777216 Oct 25 21:22 000000630000000200000008
-rw------- 1 postgres postgres 16777216 Oct 25 23:22 000000630000000200000009
-rw------- 1 postgres postgres 16777216 Oct 26 00:22 00000063000000020000000A
-rw------- 1 postgres postgres 16777216 Oct 26 02:12 00000063000000020000000B
-rw-r----- 1 postgres postgres 16777216 Oct 26 02:12 00000063000000020000000C
-rw------- 1 postgres postgres      372 Oct 26 02:12 00000063000000020000000C.00000060.backup
-rw------- 1 postgres postgres 16777216 Oct 26 02:42 00000063000000020000000D
-rw-r----- 1 postgres postgres 16777216 Oct 26 02:52 00000063000000020000000E.partial
-rw-r----- 1 postgres postgres     4338 Oct 26 06:40 00000064.history
-rw------- 1 postgres postgres 16777216 Oct 26 03:22 00000064000000020000000E
-rw------- 1 postgres postgres 16777216 Oct 26 04:22 00000064000000020000000F
-rw------- 1 postgres postgres 16777216 Oct 26 06:22 000000640000000200000010
-rw-r----- 1 postgres postgres 16777216 Oct 26 06:39 000000640000000200000011.partial
-rw-r----- 1 postgres postgres     4383 Oct 26 10:55 00000065.history
-rw------- 1 postgres postgres 16777216 Oct 26 07:10 000000650000000200000011
-rw------- 1 postgres postgres 16777216 Oct 26 10:55 000000650000000200000012.partial
-rw-r----- 1 postgres postgres     4428 Oct 26 14:50 00000066.history
-rw-r----- 1 postgres postgres 16777216 Oct 26 14:50 000000660000000200000012
-rw-r----- 1 postgres postgres     4473 Oct 26 22:59 00000067.history
-rw------- 1 postgres postgres 16777216 Oct 26 15:20 000000670000000200000013
-rw------- 1 postgres postgres 16777216 Oct 26 22:59 000000670000000200000014
-rw------- 1 postgres postgres 16777216 Oct 26 22:59 000000670000000200000015.partial
-rw-r----- 1 postgres postgres     4518 Oct 27 02:51 00000068.history
-rw------- 1 postgres postgres 16777216 Oct 26 23:29 000000680000000200000015
-rw------- 1 postgres postgres 16777216 Oct 27 00:29 000000680000000200000016
-rw------- 1 postgres postgres 16777216 Oct 27 02:12 000000680000000200000017
-rw------- 1 postgres postgres 16777216 Oct 27 02:12 000000680000000200000018
-rw------- 1 postgres postgres      377 Oct 27 02:12 000000680000000200000018.00000060.backup
-rw-r----- 1 postgres postgres 16777216 Oct 27 02:42 000000680000000200000019
-rw------- 1 postgres postgres 16777216 Oct 27 02:50 00000068000000020000001A.partial
-rw-r----- 1 postgres postgres     4563 Oct 27 06:54 00000069.history
-rw------- 1 postgres postgres 16777216 Oct 27 03:21 00000069000000020000001A
-rw------- 1 postgres postgres 16777216 Oct 27 04:21 00000069000000020000001B
-rw------- 1 postgres postgres 16777216 Oct 27 06:21 00000069000000020000001C
-rw------- 1 postgres postgres 16777216 Oct 27 06:54 00000069000000020000001D.partial
-rw-r----- 1 postgres postgres     4608 Oct 27 10:41 0000006A.history
-rw------- 1 postgres postgres 16777216 Oct 27 07:24 0000006A000000020000001D
-rw------- 1 postgres postgres 16777216 Oct 27 10:41 0000006A000000020000001E.partial
-rw-r----- 1 postgres postgres     4653 Oct 27 14:45 0000006B.history
-rw------- 1 postgres postgres 16777216 Oct 27 11:11 0000006B000000020000001E
-rw------- 1 postgres postgres 16777216 Oct 27 14:45 0000006B000000020000001F.partial
-rw-r----- 1 postgres postgres     4698 Oct 27 18:46 0000006C.history
-rw------- 1 postgres postgres 16777216 Oct 27 15:16 0000006C000000020000001F
-rw------- 1 postgres postgres 16777216 Oct 27 18:46 0000006C0000000200000020.partial
-rw-r----- 1 postgres postgres     4743 Oct 27 22:43 0000006D.history
-rw------- 1 postgres postgres 16777216 Oct 27 19:16 0000006D0000000200000020
-rw------- 1 postgres postgres 16777216 Oct 27 22:43 0000006D0000000200000021.partial
-rw-r----- 1 postgres postgres     4788 Oct 28 02:46 0000006E.history
-rw------- 1 postgres postgres 16777216 Oct 27 23:14 0000006E0000000200000021
-rw------- 1 postgres postgres 16777216 Oct 27 23:44 0000006E0000000200000022
-rw------- 1 postgres postgres 16777216 Oct 28 00:14 0000006E0000000200000023
-rw------- 1 postgres postgres 16777216 Oct 28 00:44 0000006E0000000200000024
-rw------- 1 postgres postgres 16777216 Oct 28 02:12 0000006E0000000200000025
-rw------- 1 postgres postgres 16777216 Oct 28 02:12 0000006E0000000200000026
-rw------- 1 postgres postgres      377 Oct 28 02:12 0000006E0000000200000026.00000028.backup
-rw-r----- 1 postgres postgres 16777216 Oct 28 02:46 0000006E0000000200000027
-rw-r----- 1 postgres postgres     4833 Oct 28 06:44 0000006F.history
-rw------- 1 postgres postgres 16777216 Oct 28 03:16 0000006F0000000200000028
-rw------- 1 postgres postgres 16777216 Oct 28 04:16 0000006F0000000200000029
-rw------- 1 postgres postgres 16777216 Oct 28 06:16 0000006F000000020000002A
-rw------- 1 postgres postgres 16777216 Oct 28 06:44 0000006F000000020000002B.partial
-rw-r----- 1 postgres postgres     4878 Oct 28 11:00 00000070.history
-rw------- 1 postgres postgres 16777216 Oct 28 07:14 00000070000000020000002B
-rw------- 1 postgres postgres 16777216 Oct 28 11:00 00000070000000020000002C.partial
-rw-r----- 1 postgres postgres     4923 Oct 28 14:44 00000071.history
-rw------- 1 postgres postgres 16777216 Oct 28 11:30 00000071000000020000002C
-rw-r----- 1 postgres postgres 10047488 Dec  8 11:43 RECOVERYXLOG
drwx------ 2 postgres postgres     4096 Oct 28 14:44 archive_status

pgbackrest logs

021-10-28 02:12:08.207 P01   INFO: backup file /var/lib/postgresql/data/base/13408/688131 (0B, 100%)
2021-10-28 02:12:08.207 P04   INFO: backup file /var/lib/postgresql/data/base/13408/679943 (0B, 100%)
2021-10-28 02:12:08.216 P03   INFO: backup file /var/lib/postgresql/data/base/13408/679945 (0B, 100%)
2021-10-28 02:12:08.221 P01   INFO: backup file /var/lib/postgresql/data/base/13408/679938 (0B, 100%)
2021-10-28 02:12:08.243 P00   INFO: incr backup size = 62.5MB
2021-10-28 02:12:08.243 P00   INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive
2021-10-28 02:12:08.552 P00   INFO: backup stop archive = 0000006E0000000200000026, lsn = 2/26000138
2021-10-28 02:12:08.560 P00   INFO: check archive for segment(s) 0000006E0000000200000026:0000006E0000000200000026
2021-10-28 02:12:08.836 P00   INFO: new backup label = 20211024-021204F_20211028-021203I
2021-10-28 02:12:09.228 P00   INFO: backup command end: completed successfully (6374ms)

-------------------PROCESS START-------------------
2021-10-29 02:12:11.949 P00   INFO: backup command begin 2.32: --compress-level=3 --compress-type=lz4 --config=/etc/pgbackrest/pgbackrest.conf --exec-id=735-c4f0716d --log-level-console=off --log-level-stderr=warn --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --process-max=4 --repo1-cipher-type=none --repo1-path=/staging/timescale --repo1-retention-diff=2 --repo1-retention-full=2 --repo1-s3-bucket=<redacted>--repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --stanza=poddb --start-fast --type=incr
2021-10-29 02:12:12.112 P00   WARN: unable to check pg-1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': could not connect to server: No such file or directory
                                        Is the server running locally and accepting
                                        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
2021-10-29 02:12:12.114 P00  ERROR: [056]: unable to find primary cluster - cannot proceed
2021-10-29 02:12:12.114 P00   INFO: backup command end: aborted with exception [056]

-------------------PROCESS START-------------------
2021-10-29 02:12:36.946 P00   INFO: backup command begin 2.32: --compress-level=3 --compress-type=lz4 --config=/etc/pgbackrest/pgbackrest.conf --exec-id=737-f99b8091 --log-level-console=off --log-level-stderr=warn --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --process-max=4 --repo1-cipher-type=none --repo1-path=/staging/timescale --repo1-retention-diff=2 --repo1-retention-full=2 --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --stanza=poddb --start-fast --type=incr
2021-10-29 02:12:37.080 P00   WARN: unable to check pg-1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': could not connect to server: No such file or directory
                                        Is the server running locally and accepting
                                        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
2021-10-29 02:12:37.082 P00  ERROR: [056]: unable to find primary cluster - cannot proceed
2021-10-29 02:12:37.082 P00   INFO: backup command end: aborted with exception [056]

-------------------PROCESS START-------------------
2021-10-29 02:13:02.929 P00   INFO: backup command begin 2.32: --compress-level=3 --compress-type=lz4 --config=/etc/pgbackrest/pgbackrest.conf --exec-id=739-63c091be --log-level-console=off --log-level-stderr=warn --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --process-max=4 --repo1-cipher-type=none --repo1-path=/staging/timescale --repo1-retention-diff=2 --repo1-retention-full=2 --repo1-s3-bucket=<redacted>--repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --stanza=poddb --start-fast --type=incr
2021-10-29 02:13:03.060 P00   WARN: unable to check pg-1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': could not connect to server: No such file or directory
                                        Is the server running locally and accepting
                                        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
2021-10-29 02:13:03.062 P00  ERROR: [056]: unable to find primary cluster - cannot proceed
2021-10-29 02:13:03.064 P00   INFO: backup command end: aborted with exception [056]

Additional context Kubernetes v1.20.2

ChrisTomAlx commented 2 years ago

Ran into the same issue right now. Had clean reinstalled Timescale using helm a day back. Earlier this issue didn't arise when I had it running for almost a month. Pretty curious what could have caused this.

4nte commented 2 years ago

Ran into the same issue right now. Had clean reinstalled Timescale using helm a day back. Earlier this issue didn't arise when I had it running for almost a month. Pretty curious what could have caused this.

Can you share logs, if you still have them, so we get more info on this?

4nte commented 2 years ago

Looking at the pgbackrest logs of the master instance, here is the first failed backup, after which all other backups failed too.

-------------------PROCESS START-------------------
2021-10-29 02:12:11.949 P00   INFO: backup command begin 2.32: --compress-level=3 --compress-type=lz4 --config=/etc/pgbackrest/pgbackrest.conf --exec-id=735-c4f0716d --log-level-console=off --log-level-stderr=warn --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --process-max=4 --repo1-cipher-type=none --repo1-path=/staging/timescale --repo1-retention-diff=2 --repo1-retention-full=2 --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --stanza=poddb --start-fast --type=incr
2021-10-29 02:12:12.112 P00   WARN: unable to check pg-1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': could not connect to server: No such file or directory
                                        Is the server running locally and accepting
                                        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
2021-10-29 02:12:12.114 P00  ERROR: [056]: unable to find primary cluster - cannot proceed
2021-10-29 02:12:12.114 P00   INFO: backup command end: aborted with exception [056]
ChrisTomAlx commented 2 years ago

Can you share logs, if you still have them, so we get more info on this?

Thanks for your quick reply, I am suspecting in my case, it was something to do with my k8s setup because I have backup disabled but I have 2 Prometheus instances feeding massive amounts of data into Timescale.

For now, I just clean reinstalled and increased sizes for the data (20GB) and wal (10GB) persistent volumes.

If I run into it again I will update it here asap.

nikkhils commented 2 years ago

@4nte you mentioned that one of the replicas has been promoted. That means the earlier master is no more accessible and maybe the backup was trying to connect to the old master and hence is erroring out? Shouldn't the promoted replica be connected to for the backup since it should be the new master?

4nte commented 2 years ago

@4nte you mentioned that one of the replicas has been promoted. That means the earlier master is no more accessible and maybe the backup was trying to connect to the old master and hence is erroring out? Shouldn't the promoted replica be connected to for the backup since it should be the new master?

timescale-timescaledb-0 is labeled with role=promoted, but running patronictrl topology gives:

+ Cluster: timescale (uninitialized) ------+---------+---------+-----+-----------+
| Member                    | Host         | Role    | State   |  TL | Lag in MB |
+---------------------------+--------------+---------+---------+-----+-----------+
| + timescale-timescaledb-0 | 10.244.1.170 | Replica | running | 113 |         0 |
| + timescale-timescaledb-1 | 10.244.2.31  | Replica | running |  96 |         0 |
| + timescale-timescaledb-2 | 10.244.0.104 | Replica | running |  96 |         0 |
+---------------------------+--------------+---------+---------+-----+-----------+

So It's fair to suspect that a promotion has failed mid-process, I don't know why it happened though.

abhinav-tm commented 2 years ago

@4nte try this postgresql.fastware.com