Closed mindrunner closed 1 year ago
Manually running /etc/timescaledb/scripts/restore_or_initdb.sh
or /etc/timescaledb/scripts/pgbackrest_restore.sh
seems to work :/
(e.g. it starts the restore but this is definitely not the way of doing it. bad things happen then...)
It seems like the code 1
exit comes from here:
[ "${PGBACKREST_BACKUP_ENABLED}" = "true" ] || exit 1
Commenting out this line results in backup being restored from archive on startup with a new volume.
true
on startup, but is set in a shell after entering the running container?Edit: this restore results in a non-working standby node. Idk what I am doing wrong.
Need help...
@paulfantom I see a lot of backup/restore related commits from you in december. Maybe something changed which I do not understand? Or maybe there is a regression?
+1
+1
Funny, you're posting this just in the moment where I have struggle again and no clue what's going on. Idk how much time I have to investigate, bcs I am not really getting paid anymore for this stuff. Do you mind elaborating on your +1
?
Essentially I have the same issue - backups are not being triggered in the replicas even though PGBACKREST_BACKUP_ENABLED : true
is set:
2023-06-02 08:24:52 - bootstrap - Waiting for PostgreSQL to become available
2023-06-02 08:25:05 - bootstrap - Starting pgBackrest api to listen for backup requests
2023-06-02 08:25:05,382 - INFO - backup - Starting loop waiting for backup events
2023-06-02 08:25:06,384 - INFO - history - Refreshing backup history using pgbackrest
2023-06-02 08:25:06,384 - DEBUG - backup - Waiting until backup triggered
The backup jobs themselves to s3 are also continuously running (but not sure that's an issue):
HTTP/1.0 202 Accepted
Server: BaseHTTP/0.6 Python/3.10.6
Date: Fri, 02 Jun 2023 00:18:02 GMT
Location: /backups/backup/20230602001801
Content-Type: application/json
{
"age": 1.0,
"duration": 1.0,
"finished": null,
"label": "20230602001801",
"pgbackrest": {},
"pid": 530,
"returncode": null,
"started": "2023-06-02T00:18:01+00:00",
"status": "RUNNING"
}
I don't see where your issue is related. This Ticket was about creating replicas, but you're talking about making backups.
Furthermore, it looks like your backup Job is running. Backups always run on master, never on replicas/slaves.
According to the docs every new replica will attempt to copy from an s3 backup (if available) but on creation of the pod I get this:
2023-06-02 06:54:43,219 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
/var/run/postgresql:5432 - no response
So the replicas aren't been created using pgbrackrest caused by /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
. The repilica is runnign and accepting connectiosn but now assume there is a sync issue between replica and master
Yes, that look more like the problem. I can confirm this behaviour here as well. Te replica creates itself from the master or another replica if available as a fallback. This takes way longer (in my case) which can be pain on large instances.
Turned out, that I forgot to bump my chart but reverting my local changes to fix that issue. Testing with most recent one now. So It is supposed to be fixed in 0.33.2
. Which chart version are you using?
oof, they never released the 0.33.2
.... :( This project seems really chaotic. Hang on. I will provide a workaround....
To use the most recent changes with chart 0.33.1
, add the following to your values.yml
:
patroni:
postgresql:
pgbackrest:
command: /home/postgres/pgbackrest_restore.sh
recovery_conf:
restore_command: /home/postgres/pgbackrest_archive_get.sh %f "%p"
[..]
debug:
execStartPre: curl -o /home/postgres/pgbackrest_restore.sh https://raw.githubusercontent.com/timescale/helm-charts/main/charts/timescaledb-single/scripts/pgbackrest_restore.sh && chmod +x /home/postgres/pgbackrest_restore.sh && curl -o /home/postgres/pgbackrest_archive_get.sh https://raw.githubusercontent.com/timescale/helm-charts/main/charts/timescaledb-single/scripts/pgbackrest_archive_get.sh && chmod +x /home/postgres/pgbackrest_archive_get.sh
@mathisve why so sloppy? :(
Also see https://github.com/timescale/helm-charts/issues/596
Don't leave the community behind....
I'm using 0.33.1
will test your changes now. Shame if these charts aren't being supported now tbh
2023-06-02 10:49:30.732 P00 INFO: restore command begin 2.43: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=29-3b5be8d6 --force --link-all --log-level-console=detail --pg1-path=/var/lib/postgresql/data --process-max=4 --repo1-cipher-type=none --repo1-path=/xxxxxx/timescaledb/ --repo1-s3-bucket=timescaledb-xxxx-backups --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=eu-west-1 --repo1-type=s3 --spool-path=/var/run/postgresql --stanza=poddb
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/var/lib/postgresql/data' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2023-06-02 10:49:30.868 P00 INFO: repo1: restore backup set 20230602-065746F, recovery will start at 2023-06-02 06:57:46
2023-06-02 10:49:30.868 P00 DETAIL: check '/var/lib/postgresql/data' exists
2023-06-02 10:49:30.869 P00 DETAIL: check '/var/lib/postgresql/wal/pg_wal' exists
...
2023-06-02 10:49:41.954 P00 DETAIL: sync path '/var/lib/postgresql/data/pg_wal/archive_status'
2023-06-02 10:49:41.954 P00 DETAIL: sync path '/var/lib/postgresql/data/pg_xact'
2023-06-02 10:49:41.955 P00 INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2023-06-02 10:49:41.956 P00 DETAIL: sync path '/var/lib/postgresql/data/global'
2023-06-02 10:49:41.957 P00 INFO: restore size = 39.8MB, file total = 1745
2023-06-02 10:49:41.958 P00 DETAIL: statistics: {"http.client":{"total":1},"http.request":{"total":2},"http.session":{"total":1},"socket.client":{"total":1},"socket.session":{"total":1},"tls.client":{"total":1},"tls.session":{"total":1}}
2023-06-02 10:49:41.958 P00 INFO: restore command end: completed successfully (11229ms)
That seems to have done it for me @mindrunner 👍
It doesn't make sense to me why the new script works but the existing one (/etc/timescaledb/scripts/pgbackrest_restore.sh) doesn't when the only change I can see is just that the [ "${PGBACKREST_BACKUP_ENABLED}" = "true" ] || exit 1
condition have been moved to the top?
See PR for explanation
Ah I see, sourcing in the env_file is required to access PGBACKREST_BACKUP_ENABLED
which must default to false
What did you do?
When starting a pod with empty storage it usually restores from azure backup. This works without issues for the first pod in statefulset after installing the chart. However, on every subsequent pod, I only see the error:
and patroni restores the database with
which comes with several downsides, (e.g. very slow compared to pgbackrest, error prone, wal volume fills up, etc)
This used to be different. Every new pod was restored by pgbackrest without any issues.
I am not sure if a config change on my side is the problem or if the chart might have a bug.
Environment
values.yaml
?