opensearch-project / opensearch-k8s-operator

OpenSearch Kubernetes Operator
Apache License 2.0
387 stars 203 forks source link

Support renewal of generated certificates #399

Open swoehrl-mw opened 1 year ago

swoehrl-mw commented 1 year ago

The operator can generate its own self-signed certificates to use for the opensearch pods. However the operator does not have functionality to renew the certificates once they expire after a year.

The operator should check during reconcile runs if any certificates are about to expire and renew them if needed. After renewal the operator needs to do a rolling restart of the opensearch pods so they pick up the new certificates.

rursprung commented 1 year ago

i'm not sure if OpenSearch Security already has this feature (the documentation for OpenSearch is still incomplete), but at least Search Guard supports TLS certificate hot reloading. if OpenSearch Security supports this (or support is added for it) then the operator could use the hot reload API to trigger the re-load.

do you have plans on how to handle the rollover of the CA (which will also expire at some point)?

IMHO this correlates a bit with #141 as cert-manager could take care of some things (though the triggering of the hot reload would still have to be done as cert-manager isn't aware of OpenSearch)

swoehrl-mw commented 1 year ago

@rursprung

then the operator could use the hot reload API to trigger the re-load.

Sounds like a good idea. Although we might still need to implement a restart variant for older opensearch versions.

do you have plans on how to handle the rollover of the CA (which will also expire at some point)?

No idea yet. Maybe something where a new CA is generated ahead of time and the certificate is signed by both CAs for a time to give services to switch out their CA. Depends a bit on if clients actually use the CA cert to verify connections. Suggestions are always welcome.

rursprung commented 1 year ago

Depends a bit on if clients actually use the CA cert to verify connections. Suggestions are always welcome.

AFAIK the nodes do for node-to-node communication. not sure about clients (i guess "it depends" is the proper answer, though i'd expect that they do by default nowadays)

neeraj-n-singh commented 1 year ago

Proposed Solution: We will create a secret based on the existing flag which will control whether we need per node certificate or a single certificate for all nodes. In the case of a single certificate for all nodes, we will create a certificate object and then map the secret created using that object into the Opensearch custom resource. In the case of per node certificate, we will generate multiple certificates using and merge them into a single secret let's say node-cert-merged(using custom code and adding watcher in the same), as there will be any certificate changes, we will sync the node-cert-merged.

Solutions:

  1. we can have a node-cert-merged reconciler which manages(watch and updates) and does the trick.
  2. We can have a webhook that will watch the secret events and update the master secret accordingly.
rursprung commented 1 year ago

and merge them into a single secret

i'd suggest to ask some security experts for their opinion on this. i doubt that they'll be happy with the private key for one node being visible to another node. i don't think that it's super critical in this case, but it definitely goes against the best practices for private/public key usage (where you never, ever give anyone else access to your private key).

Alwinius commented 1 year ago

Hi, Thanks for thinking about this issue. This is a very important topic for us since an expired certificate will break the whole OS cluster. Our current workaround is to create certificates with long expiry dates manually, but since transport encryption is only within the cluster, I would like to do no manual steps at all. When using the certificates generated by the operator, is there a way to trigger a renew manually? Like for example removing config from OpenSearchCluster manifest (so that demo certificates are used) and then adding it again?

PS: Hi @swoehrl-mw we worked together a long time ago at MW, nice to see you again :)

swoehrl-mw commented 1 year ago

Hi @Alwinius

is there a way to trigger a renew manually?

Without having tested it: If you delete the <cluster-name>-transport-cert and <cluster-name>-http-cert secrets the operator should generate new ones during the next reconcile run (so after 30 seconds). Afterwards you would need to get the operator to do a rolling restart (for example by adding a dummy change to the config). Theoretically this should work without downtime.

we worked together a long time ago at MW, nice to see you again :)

The world feels small ;-)

Gokul-Radhakrishnan commented 1 year ago

Hi @swoehrl-mw

Afterwards you would need to get the operator to do a rolling restart (for example by adding a dummy change to the config). Theoretically this should work without downtime.

Kubernetes tracks the change in secret and updates the volume automatically

From k8s documentation: When a volume contains data from a Secret, and that Secret is updated, Kubernetes tracks this and updates the data in the volume, using an eventually-consistent approach

The operator should check during reconcile runs if any certificates are about to expire and renew them if needed.

Doing this alone should be fine I feel

swoehrl-mw commented 1 year ago

@Gokul-Radhakrishnan

If we enable hot-reload of certs (which AFAIK is disabled by default) then yes, just updating the secrets should be enough.

jonathon2nd commented 11 months ago

Any movement on this?

Setup a new cluster using our PKI, following this.

I set it up a couple days ago, and went to make a change this morning but the node I restarted would not come up with errors like. javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed [ERROR][o.o.s.s.t.SecuritySSLNettyTransport] [mycluster-masters-2] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Even though it has mounted the new cert. In order to get the everything up and running all nodes need to be restarted.

Using these certs

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: opensearch-certs-pki
  namespace: opensearch
spec:
  secretName: opensearch-certs-pki
  privateKey:
    size: 2048
    algorithm: RSA
    encoding: PKCS8
  dnsNames:
    - mycluster
    - mycluster-masters-0
    - mycluster-masters-1
    - mycluster-masters-2
    - mycluster-bootstrap-0
    - mycluster-discovery
    - mycluster.opensearch
    - mycluster.opensearch.svc
    - mycluster.opensearch.svc.cluster.local
  usages:
    - key encipherment
    - server auth
    - client auth
  commonName: Opensearch_Node
  issuerRef:
    group: certmanager.step.sm
    kind: StepClusterIssuer
    name: step-issuer
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: opensearch-admin-certs-pki
  namespace: opensearch
spec:
  secretName: opensearch-admin-certs-pki
  privateKey:
    size: 2048
    algorithm: RSA
    encoding: PKCS8
  commonName: OpenSearch_Admin
  usages:
    - key encipherment
    - server auth
    - client auth
  issuerRef:
    group: certmanager.step.sm
    kind: StepClusterIssuer
    name: step-issuer
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: opensearch-dashboards-certs-pki
  namespace: opensearch
spec:
  secretName: opensearch-dashboards-certs-pki
  privateKey:
    size: 2048
    algorithm: RSA
    encoding: PKCS8
  dnsNames:
    - mycluster-dashboards
  usages:
    - key encipherment
    - server auth
    - client auth
  issuerRef:
    group: certmanager.step.sm
    kind: StepClusterIssuer
    name: step-issuer

And this config

---
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: mycluster
  namespace: opensearch
spec:
  security:
    tls:  # Everything related to TLS configuration
      transport:
        generate: false
        perNode: false
        secret:
          name:  opensearch-certs-pki
        nodesDn: ["CN=Opensearch_Node", ] 
        adminDn: ["CN=OpenSearch_Admin", ]
      http:
        generate: false
        secret:
          name: opensearch-certs-pki
    config:
      adminSecret:
        name: opensearch-admin-certs-pki
      securityConfigSecret:
        name: securityconfig-secret
      adminCredentialsSecret:
        name: mycluster-admin-password
  general:
    serviceName: mycluster
    version: 2.10.0
    setVMMaxMapCount: true
  dashboards:
    enable: true
    opensearchCredentialsSecret:
      name: mycluster-admin-password
    tls:
      enable: true
      generate: false
      secret:
        name: opensearch-dashboards-certs-pki
    version: 2.10.0
    replicas: 2
jonathon2nd commented 11 months ago

For now, I am going to test with this: https://github.com/stakater/Reloader

KannappanSomu commented 11 months ago

my certificates expired today :(

asturm-fe commented 10 months ago

Without having tested it: If you delete the <cluster-name>-transport-cert and <cluster-name>-http-cert secrets the operator should generate new ones during the next reconcile run (so after 30 seconds). Afterwards you would need to get the operator to do a rolling restart (for example by adding a dummy change to the config). Theoretically this should work without downtime.

@KannappanSomu could you perhaps confirm this approach as a feasible workaround until automatic cert-renewal is implemented?

ibotty commented 10 months ago

@KannappanSomu could you perhaps confirm this approach as a feasible workaround until automatic cert-renewal is implemented?

This worked in my cluster. I still had to scale down the operator to update (recreate) the statefulset with spec.podManagementPolicy: Parallel though. That's bug #685.

KannappanSomu commented 10 months ago

@asturm-fe Works for my cluster too. thanks !

albgus commented 9 months ago

This really seems like something that should have a clear warning in the docs: Like, your cluster will stop working after exactly one year.

Or at least put a warning that this project is far from mature if something critical like this can go unsolved for more than a year after reporting..

Siradjedd commented 6 months ago

@jonathon2nd any updates ?

AniketKariya commented 5 months ago

@swoehrl-mw

If you delete the -transport-cert and -http-cert secrets the operator should generate new ones during the next reconcile run (so after 30 seconds).

The admin certs expire as well right? Don't we need to regenerate them as well? With them expired, Security APIs that require admin cert auth might not work, right?

swoehrl-mw commented 5 months ago

The admin certs expire as well right? Don't we need to regenerate them as well? With them expired, Security APIs that require admin cert auth might not work, right?

@AniketKariya You are correct, the admin cert needs to be recreated as well, otherwise the securityconfig-update job would not work.

flavienbwk commented 2 months ago

I have created this repo as a temporary (but stable) help while we wait for an official implementation in the operator: https://github.com/flavienbwk/opensearch-k8s-certmanager

It explains how to setup cert-manager + Reloader with ready-to-deploy examples. Might also help for #141.