timescale / helm-charts

Configuration and Documentation to run TimescaleDB in your Kubernetes cluster
Apache License 2.0
263 stars 223 forks source link

ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1 #317

Open imranrazakhan opened 2 years ago

imranrazakhan commented 2 years ago

We have following environment

# kubectl -n dev logs -f timescaledb-0 -c timescaledb
2021-12-04 21:55:53,775 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:55:53,775 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:04,206 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:56:04,207 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:14,207 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:56:14,207 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:24,207 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2021-12-04 21:56:24,207 ERROR: failed to bootstrap (without leader)
2021-12-04 21:56:34,207 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1

I logged into timescaledb pod and check patroni status, this is first instance and why its role is replica rather than master?

$ patronictl list
+ Cluster: yq (uninitialized) +---------+---------+----+-----------+
| Member        | Host        | Role    | State   | TL | Lag in MB |
+---------------+-------------+---------+---------+----+-----------+
| timescaledb-0 | 10.244.0.78 | Replica | stopped |    |   unknown |
+---------------+-------------+---------+---------+----+-----------+
davidandreoletti commented 2 years ago

@imranrazakhan Delete k8s services from this chart (load balancer, nodeip related ones depending on your config) from the previous helm deployment should resolve this issue, as it did for me.

imranrazakhan commented 2 years ago

@davidandreoletti Thanks for updates i will check this, Can we have more insight why we have to delete services? is it related to endpoint? i check ep yaml file but couldn't find any hint which stopping us to do clean start?

jholm117 commented 2 years ago

Having the same issue. Deleting the resources from the previous helm deployment did not solve the issue for me.

bleggett commented 2 years ago

Same issue - and confirmed no resources left in cluster from previous install.

jholm117 commented 2 years ago

Having the same issue. Deleting the resources from the previous helm deployment did not solve the issue for me.

I was able to get this working eventually, it's possible I missed cleaning up an endpoint or something.

imranrazakhan commented 2 years ago

@jholm117 @davidandreoletti we can fix issue by just deleting one ep (EndPoint) with name like clustername-config, where clustername is name provided during helm installation.

veereshhalagegowda commented 2 years ago

I am still seeing this issue after using different release name and deleting older endpoints. It just stops suddenly after sometime. Any different solutions would be greatly appreciated. Thanks.

jleni commented 1 year ago

Same here. It is happening in the latest release 0.27.4 It resolves automatically after a few minutes

jprecuch commented 1 year ago

Same issue here. Happens on latest 0.27.5 as well. Would be good to see this fixed finally

jfaldanam commented 1 year ago

Same issue here, moving the deployment to a new namespace solved it temporarilly for me

JohnTzoumas commented 1 year ago

Removing endpoints from a previous helm deployment solved it for me.

ayeks commented 1 year ago

@JohnTzoumas thanks a lot! I have the same issue with the latest version.

I test disaster recovery right now and killed all PVCs + PODs. The startup of the new timescale pod stops at:

timescaledb 2023-05-25 11:53:56,422 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1

When I delete the 4 endpoints the recovery runs through.

MSandro commented 8 months ago

I have the same issue. But in my case I have disabled the persistent storage, because in our dev environment we would like to clean the db by just restarting the container. I have also tried to set this to false: patroni.postgresql.pgbackrest.keep_data = false but no effect.