timescale / timescaledb-docker-ha

Create Docker images containing TimescaleDB, Patroni to be used by developers and Kubernetes.
Apache License 2.0
155 stars 44 forks source link

IMPORTANT: Missing ssl_ca_file in Spilo Config #435

Closed MarkCupitt closed 8 months ago

MarkCupitt commented 8 months ago

Patroni supports the inclusion of ssl_ca_file so that PostgreSQL can use custom tls certs with root cert validation (we require it for teleport db access, our database is airgapped, so we proxy to it via teleport)

patroni/features/environment.py supports ssl_ca_file patroni/patroni/postgresql/validator.py supports ssl_ca_file helm-charts/charts/timescaledb-single/values.yaml supports ssl_ca_file

However

timescaledb-docker-ha/scripts/configure_spilo.py omits ssl_ca_file

which means that postgres never gets the ca and tls fails for any session that requires tls validation

This makes TimescaleDb unusable in many scenarios

MarkCupitt commented 8 months ago

I'm thinking that adding one line may solve this ...

timescaledb-docker-ha/scripts/configure_spilo.py

    ssl: 'on'
    ssl_cert_file: {{SSL_CERTIFICATE_FILE}}
    ssl_key_file: {{SSL_PRIVATE_KEY_FILE}}
    ssl_ca_file: {{{postgresql.parameters.ssl_ca_file}}}
    shared_preload_libraries: 'bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon'

There may need to be some logic somewhere to exclude it if its not supplied, not sure what postgreSql will do if its set to empty

MarkCupitt commented 8 months ago

So, I built a custom image after making the above changes using:

PG_MAJOR="15" PG_VERSIONS="15.5" TIMESCALEDB_VERSIONS="2.13.0" make build

Tagged it and uploaded it to my Dockerhub, docker pull markcupitt/timescaledb:v1.0.0, loaded it to our Kubernetes cluster and can confirm I am now able to connect via SSL as originally expected.

I will next try it by removing the ca file from the pod, to test what PostgreSQL does when no ca is specified, but this ends up in the postgres parameters

ssl_ca_file: 
MarkCupitt commented 8 months ago

I did this test, and can confirm that PostgreSQL threw a fit :) with the following log entry

2024-01-10 04:11:52 UTC [32]: [659e1908.20-1] @,app= [F0000] FATAL:  could not load root certificate file "/etc/certificate/ca.crt": SSL error code 2147483650

So some logic in there somewhere is obviously required to ensure that parameter is populated with the correct ca or not included

Possibly in patroni/features/environment.py around line 218, remove the entry for ssl_ca_cert if empty or the file is non existent

MarkCupitt commented 8 months ago

Some additional issues with the Helm chart can be found at:

https://github.com/timescale/helm-charts/issues/641 https://github.com/timescale/helm-charts/issues/642 https://github.com/timescale/helm-charts/issues/639

alexeyklyukin commented 8 months ago

The configure_spilo.py is deprecated long ago. Patroni is configured from the helm chart:

https://github.com/timescale/helm-charts/blob/7ded6b654c956a3f6dc119d90b47a0262eba600e/charts/timescaledb-single/values.yaml#L193

That configuration is stored in its own configmap https://github.com/timescale/helm-charts/blob/main/charts/timescaledb-single/templates/configmap-patroni.yaml

and is mounted into the Patroni container https://github.com/timescale/helm-charts/blob/7ded6b654c956a3f6dc119d90b47a0262eba600e/charts/timescaledb-single/templates/statefulset-timescaledb.yaml#L291

using the volume defined from that configmap: https://github.com/timescale/helm-charts/blob/7ded6b654c956a3f6dc119d90b47a0262eba600e/charts/timescaledb-single/templates/statefulset-timescaledb.yaml#L480

alexeyklyukin commented 8 months ago

@MarkCupitt one may need to add support for the ssl_ca_file into the helm chart, but configure_spilo.py is a wrong place to add it. It is still there for compatibility reasons with the zalando postgres-operator, although that compatibility is going to be removed at some point in the future, since the original spilo image will probably drift away from this one, and Patroni supports new parameters that the configure_spilo.py here is not aware of.

I think if one wants to retain this compatibility, the right actions would be moving conifgure_spilo.py and augment_patroni_configuraiton.py into the scripts configmap in the helm chart, decoupling it from this docker image.

MarkCupitt commented 8 months ago

@alexeyklyukin If we include the ssl_ca_file in the helm chart, which we can in the postgres section, it IS included in the configmap, however it gets stripped out and does not end up in the postgres config in the container, I have verified that.

If I add the ssl_ca_file into a running container postgres parameters and reload the config, it works as expected.

SO whatever script runs at container init is the point of failure, I presumed it was the configure_spilo.py, in the container as part of the patroni deployment process but based on what you say, there must be some other process that runs????

Do you know what that is, so I can look at a fix, I have only been able to find the configure_spilo script in the container that seems to access that parameter... I did a global repo search

I don't think this is a helm issue as the chart does what it is supposed to and ssl_ca_file does end up in the config map, and is mounted, it breaks down after that point

alexeyklyukin commented 8 months ago

@MarkCupitt the container init is at https://github.com/timescale/timescaledb-docker-ha/blob/26d75a1e50e543123c6a692e6551ddf95bed870f/timescaledb_entrypoint.sh#L1-L50

unless you specify the SPILO_CONFIGURATION environment variable, which the helm chart doesn't, you'll get straight to Patroni configuration defined in the values.yaml of the helm chart.

In the helm chart values, you have the the certificateSecretName in the helm chart, allowing your to override the secret for mounting certificates into the container, and patroni.bootstrap.dcs.postgresql.parameters where you can reference ssl_ca_file, pointing it to the content you've put into the certificateSecretName secret.

The default behavior, if you don't specify the custom secret in the certificateSecretName, is to write a self-signed one at https://github.com/timescale/helm-charts/blob/7ded6b654c956a3f6dc119d90b47a0262eba600e/charts/timescaledb-single/templates/secret-certificate.yaml

MarkCupitt commented 8 months ago

@alexeyklyukin in a Kubernetes environment adding a Ca in certificateSecretName breaks Patroni. This certificate is used to talk to the Kubernetes API and of course. on a custom cert, kubernetes does not know about it. Patroni DCS uses teh Kubernetes API for its object store. Its the first thing we tried.

The self signed cert does happen, but that cert cannot be used with Teleport per our use case we originally stated

ssl_ca_file in postgres parameters is REMOVEd by patroni, I gave you the code where that happened

This is going in circles .. we will have to find alternatives I guess .. an adjust manage our own builds ..

Bottom line is that the container needs to be fixed, we proved it and have it running in our own build, I gave the changes we made to get it working .. the only consideration is if a ca file is not specified, then some logic is required to remove the ssl_ca_file form the config or postgres will break, I suspect that people believe that certificateSecretName is the way to do it, its not, it breaks Patroni in a Kubernetes environment as I stated above

MarkCupitt commented 8 months ago

@alexeyklyukin our decision at this stage is that until timescaledb can fix the containers so we can specify a ca, we have no choice but to maintain our own builds (Which we are using now) ... We will drop the helm deployment and use a manifest based deployment using KLUCTL as the deployment tool. We will make our code and configs available in our GIthub Repos for anyone else who needs it, once we have it looking respectable

We will happily go back to official builds once the container can be given a Ca in the postgres parms section of the patroni config and it's not removed, unfortunately, we need to progress our project, our timelines are slipping significantly now

MarkCupitt commented 8 months ago

@alexeyklyukin if you feel that there is nothing wrong with the container or the patroni script , then please close this ticket, The container is unusable for us the way it is right now ..

However I really hope you guys will look at this issue, replicate it, and fix it so we can go back to the official builds OR tell us what we need to do to supply our own ca if its is indeed NOT the patroni script in the container that is the root cause of this issue