Improve the failover process for Replica

JuliuszJ commented 3 years ago

Hi Alex, many thanks for your amazing work with Kubegres. I tried to test crash of secondary Postgres. I conducted 2 separate tests. In 1st test I scaled down STS of secondary Postgres to zero. In second test I stopped k3d node on which that STS is running. Unfortunately in both test nothing happened. It would be great if Kubegres will run new instance of secondary Postgres in that case to achieve desired state. Regards, Juliusz.

alex-arica commented 3 years ago

Hi Juliusz,

Thank you for your kind feedback about Kubegres.

In regards to the 1st test, how did you scale down STS of secondary Postgres to zero? Did you delete the STS?

In second test, which command did you use to stop a k3d node?

Once I have those details, I will try reproducing the issue.

alex-arica commented 3 years ago

Hi Juliusz, since I have not heard from you after 2 days. I am closing this issue. Please re-open it when you have more details to provide as per my previous message.

JuliuszJ commented 3 years ago

Hi Alex, in 1st test I scaled down to 0 of STS only, I did not delete it. In 2nd test I issued docker stop command to stop specific container running k3d node.

JuliuszJ commented 3 years ago

BTW. As no repo collaborator I can not reopen my issue:

alex-arica commented 3 years ago

Thank you for those details.

What command did you use exactly to scale down to 0 of STS? Did you edit an STS and set its replica to 0?

And once you have done the above, what were the logs in the Kubegres controller?

JuliuszJ commented 3 years ago

kubectl scale sts my-instance-of-kubegres-1 --replicas 0

alex-arica commented 3 years ago

Kubegres is an operator which connects to the API of Kubernetes. And Kubernetes notifies Kubegres when actions are performed on STS, Pods and services.

The reason why I need the controller's logs is to make sure that when you run the command kubectl scale sts my-instance-of-kubegres-1 --replicas 0 then Kubernetes notified Kubegres about that spec change.

The same approach applies for docker stop command to stop specific container running k3d node.

If Kubernetes does not notify Kubegres, there is nothing we can do.

JuliuszJ commented 3 years ago

Actually command kubectl scale sts my-instance-of-kubegres-1 --replicas 0 stops STS which runs primary DB, and Kubegress works great in such case: 1. secondary DB is promoted, 2. new secondary is created. 3. STS with old primary is deleted. The problem is when I issued kubectl scale sts my-instance-of-kubegres-2 --replicas 0 to stop STS which runs secondary DB. I expected then new secondary will be created. But it not happened. How can I gather requested log?

alex-arica commented 3 years ago

Thanks to you, I have sufficient information to investigate this issue. I will have time to investigate it tomorrow.

We have a set of automatised tests to simulate failover. Those tests check by either deleting a Pod or a StatefulSet. Perhaps we missed one use case. Let's see.

alex-arica commented 3 years ago

@JuliuszJ I released a "beta" version in the main branch which fix the issue that you reported. To install it and test it please run:

kubectl apply -f  https://raw.githubusercontent.com/reactive-tech/kubegres/main/kubegres.yaml

Please let me know if it works for you. Once you confirmed it, I will release a new version of Kubegres.

alex-arica commented 3 years ago

@JuliuszJ Do you think you could help by testing this change today? I am planning to release it Wednesday.

All you have to do is to conduct the 2 tests that you mentioned in your initial message when you created this issue.

I released a "beta" version in the main branch which fix the issue that you reported. To install it and test it please run:

kubectl apply -f  https://raw.githubusercontent.com/reactive-tech/kubegres/main/kubegres.yaml

Please let me know if it works for you. Once you confirmed it, I will release a new version of Kubegres.

JuliuszJ commented 3 years ago

@alex-arica @ylck I ran both my tests and they finished successfully. Thank you very much! However after test with stopping of k3d node I noticed something strange. After execution of command: kubectl exec -it my-kubegres-instance-6-0 -- /bin/bash i got message: Defaulted container "my-kubegres-instance-6" out of: my-kubegres-instance-6, setup-replica-data-directory (init) It seems that sidecar (init) container setup-replica-data-directory is still running. It was not happened when I scaled STS to 0.

alex-arica commented 3 years ago

Thank you for checking. What do you see in the logs of that pod?

JuliuszJ commented 3 years ago

on new primary: 02/11/2021 10:03:11 - Attempting to promote a Replica PostgreSql to Primary... 02/11/2021 10:03:11 - Promoting by creating the promotion trigger file: '/data/pgdata/promote_replica_to_primary.log' on new secondary: 02/11/2021 10:04:24 - Attempting to copy Primary DB to Replica DB... ls: cannot access '/data/pgdata': No such file or directory 02/11/2021 10:04:24 - Copying Primary DB to Replica DB folder: /data/pgdata 02/11/2021 10:04:24 - Running: pg_basebackup -R -h rbd-citus-coord -D /data/pgdata -P -U replication; waiting for checkpoint 0/27542 kB (0%), 0/1 tablespace 11683/27542 kB (42%), 0/1 tablespace 27552/27552 kB (100%), 0/1 tablespace 27552/27552 kB (100%), 1/1 tablespace 02/11/2021 10:04:24 - Copy completed It seems init containers finished their job.

alex-arica commented 3 years ago

Thank you. Yes it managed to copy it's data from primary.

Is there anything else in the logs saying that that replica pod is streaming data from primary pod?

alex-arica commented 3 years ago

Are you able to connect to that replica and run SQL queries?

And is there anything else in the logs saying that that replica pod is streaming data from primary pod?

JuliuszJ commented 3 years ago

Are you able to connect to that replica and run SQL queries?

yes

And is there anything else in the logs saying that that replica pod is streaming data from primary pod?

Just I am away from my test environment and I can not check the logs. However after my my tests I checked the replication by creation of a table on primary DB which was successfully replicated to secondary DB.

alex-arica commented 3 years ago

Thank you for your kind help with testing. If the init container keeps hanging, we can open a new issue about it and I will investigate it.

I will release a new version of Kubegres this evening London time.

JuliuszJ commented 3 years ago

Thank you very much for quick fix.

alex-arica commented 3 years ago

Kubegres version 1.13 is available with the changes that we discussed about in this issue.

Please see the release page: https://github.com/reactive-tech/kubegres/releases/tag/v1.13

Thank you @JuliuszJ for your help!

To install Kubegres 1.13, please run:

kubectl apply -f https://raw.githubusercontent.com/reactive-tech/kubegres/v1.13/kubegres.yaml

I am closing this issue.

reactive-tech / kubegres

Improve the failover process for Replica #60