Fix CNPG failover - Githubissues

samcday commented 3 months ago

Originally this issue was "Remove CNPG". It's now "Fix CNPG because there's literally no other option except to completely hand roll a Postgres deployment from scratch and I'd rather punch myself repeatedly in the crotch than do this"

Postgres + Kubernetes is cursed.

First, there was the dumpster fire that Zalando inflicted on the OSS world with their comically bad operator. Then there was the CrunchyData one, which was so tragically documented and maintained I think it might actually be some kind of militiary psy-op posing as a software project.

CNPG was a breath of fresh air, because it actually works reasonably well on the happy path, didn't make a dog's breakfast of backup/restore, and has good documentation.

Unfortunately, in practice, CNPG is also terrible. A light breeze knocks clusters over, and replicas constantly end up in a broken state that requires manual remediation.

The final straw was today when I went around the cluster, replacing a bunch of ethernet cables with some nice short-length ones. Just yanking and reconnecting some cables was enough to bring down several of the DB clusters.

I could dig into this, figure out a reproducible test case, and contribute that (and maybe even a fix) upstream. That would be the right thing to do. I don't want to fucking do the right thing here. I just want a Postgres database running in Kubernetes that is reliable and backed up.

I think the best approach will to be just build a handful of DB clusters with the bitnami Helm chart and accept that they'll need some occasional petting.

samcday commented 3 months ago

So the proverbial straw is a known issue, at least. Though in some ways that's worse, because this is a critical flaw in the operator that was first raised more than 4 months ago. It still has not had any acknowledgement from the maintainers.

As far as I'm concerned, this is a smoking gun that demonstrates CNPG is, sadly, dead. Or at least, it's dead to me. 6 months of bashing my head against a brick wall is long enough, thank you very much.

Still, I'm somewhat optimistic. I was having similar problems several years ago with my first forays into the Zalando/CrunchyData shitshow. CNPG was a major improvement over the status quo. I will hold out hope that the next iteration in this space yields an operator that is actually worth relying on :crossed_fingers:

samcday commented 3 months ago

Ugh. Turns out the bitnami postgres-ha chart is also utter garbage.

The ergonomics are pretty terrible, but I'm used to that kind of abuse with Bitnami Helm charts. The way it does user management is particularly awful though - you have to repeat the users/passwords in both the postgres instances and the pgpool deployment. The users and passwords have to be comma/semicolon delimited. My brain literally cannot.

The kicker is that this thing isn't remotely "HA". If you do a rolling restart of the postgres nodes, the frontend pgpools shit the bed indefinitely until you manually rollout restart them. This was first reported in July 2022, ignored the entire time, and raised several more times, where it was also ignored.

Holy moly. I'm really struggling to accept just how sad the state of the k8s ecosystem is for Postgres. Postgres deserves so much more than this :(

samcday commented 3 months ago

I guess the outcome of this day is there is no alternative but to be a good citizen, roll up my sleeves, and figure out wtf is up with CNPG. God damnit.

samcday / home-cluster

Fix CNPG failover #680