opensearch-project / opensearch-k8s-operator

OpenSearch Kubernetes Operator
Apache License 2.0
400 stars 217 forks source link

invalid memory address or nil pointer in operator-controller-manager when providing http certs #666

Open bmaguireibm opened 11 months ago

bmaguireibm commented 11 months ago

Hi, thanks for the great operator. I believe I've hit a bug when trying to provide my own certificates for the external http api. Below are the details of the error, any help is greatly appreciated.

Kubernetes version: v1.26.6 opensearch-operator version: 2.4.0 platform: AKS

Expected behaviour: I was trying to provide a TSL certificate for the HTTP API. The secret is generated by vault secret operator, but ultimately this produces a Kubernetes tls secret in PEM format with tls.key, tls.crt. I also provide a separate secret for ca.crt. Both secrets are generated in the same namespace and appear to be valid PEM formatted certs with the correct keys. I expect the cluster to be created using this cert for it's http api at 9200.

Actual behaviour: The operator-controller-manager goes into crash loop back off with the following error in the logs.

{"level":"info","ts":"2023-11-23T15:35:37.377Z","msg":"Generating certificates","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"test-cluster","namespace":"middleware"},"namespace":"middleware","name":"test-cluster","reconcileID":"124ef20c-40a0-4fa3-8695-5a667cda86ab","interface":"transport"}
{"level":"info","ts":"2023-11-23T15:35:37.388Z","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"test-cluster","namespace":"middleware"},"namespace":"middleware","name":"test-cluster","reconcileID":"124ef20c-40a0-4fa3-8695-5a667cda86ab"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1321275]

goroutine 328 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:115 +0x1fa
panic({0x18c5d60, 0x2aee8c0})
        /usr/local/go/src/runtime/panic.go:884 +0x212
opensearch.opster.io/pkg/tls.(*implCertValidater).IsSignedByCA(0xc000fd69b0, {0x1d9a8a0?, 0xc002450690?})
        /workspace/pkg/tls/pki.go:265 +0x35
opensearch.opster.io/pkg/reconcilers.(*TLSReconciler).shouldCreateAdminCert(0xc0001cf080, {0x1d9a8a0, 0xc002450690})
        /workspace/pkg/reconcilers/tls.go:211 +0x23d
opensearch.opster.io/pkg/reconcilers.(*TLSReconciler).createAdminSecret(0xc0001cf080, {0x1d9a8a0, 0xc002450690})
        /workspace/pkg/reconcilers/tls.go:224 +0x45
opensearch.opster.io/pkg/reconcilers.(*TLSReconciler).handleAdminCertificate(0xc0001cf080)
        /workspace/pkg/reconcilers/tls.go:122 +0x6a
opensearch.opster.io/pkg/reconcilers.(*TLSReconciler).Reconcile(0xc0001cf080)
        /workspace/pkg/reconcilers/tls.go:83 +0x89
opensearch.opster.io/controllers.(*OpenSearchClusterReconciler).reconcilePhaseRunning(0xc00003e690, {0x1d99898, 0xc0008b03c0})
        /workspace/controllers/opensearchController.go:321 +0x74b
opensearch.opster.io/controllers.(*OpenSearchClusterReconciler).Reconcile(0xc00003e690, {0x1d99898, 0xc0008b03c0}, {{{0xc0007b4066, 0xa}, {0xc0001bb188, 0x17}}})
        /workspace/controllers/opensearchController.go:142 +0x768
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1d99898?, {0x1d99898?, 0xc0008b03c0?}, {{{0xc0007b4066?, 0x1829e20?}, {0xc0001bb188?, 0x10?}}})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000538640, {0x1d997f0, 0xc00052e740}, {0x1942280?, 0xc000152420?})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314 +0x3a5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000538640, {0x1d997f0, 0xc00052e740})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:222

My cluster config is as follows:


apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: test-cluster
spec:
  security:
    config:
    tls:
       http:
          generate: false
          secret:
            name: opensearch-certs
          caSecret:
            name: ca-secret
       transport:
          generate: true
          perNode: true
  general:
    httpPort: 9200
    serviceName: test-cluster
    version: 2.3.0
    pluginsList: ["repository-s3"]
    drainDataNodes: true
    setVMMaxMapCount: true
  dashboards:
    tls:
      enable: true
      generate: true
    version: 2.3.0
    enable: true
    replicas: 1
    diskSize: "10Gi"
    resources:
      requests:
         memory: "512Mi"
         cpu: "200m"
      limits:
         memory: "512Mi"
         cpu: "200m"
  nodePools:
    - component: masters
      replicas: 3
      resources:
         requests:
            memory: 4Gi
            cpu: 1000m
         limits:
            memory: 4Gi
            cpu: 1000m
      roles:
        - "data"
        - "cluster_manager"
swoehrl-mw commented 11 months ago

Hi @bmaguireibm Following the stacktrace in your error log it looks like the operator cannot decode the PEM data of the CA certificate. Can you please verify if the provided cert is actually valid (for example by checking with openssl)?

You can also use the following small go program to read the cert the same way the operator does:

package main

import (
    "crypto/x509"
    "encoding/pem"
    "fmt"
    "os"
)

func main() {
    data, err := os.ReadFile("ca.crt")
    if err != nil {
        fmt.Printf("Could not open file: %s\n", err)
        return
    }
    block, _ := pem.Decode(data)
    if block == nil {
        fmt.Printf("Could not decode as PEM data\n")
        return
    }
    caCert, err := x509.ParseCertificate(block.Bytes)
    if err != nil {
        fmt.Printf("Could not parse certificate: %s\n", err)
        return
    }
    fmt.Printf("Certificate has subject '%s'\n", caCert.Subject)
}

Just run it with go run main.go (assuming you placed the code in main.go and the cert from the secret in ca.crt).

Regardless of this, even in the case of invalid data the operator should not crash but should provide a proper error message and continue, so this is a bug either way.

mchiappini commented 11 months ago

I've got the same error as @bmaguireibm. The output from the go program shows the following:

Certificate has subject 'CN=os,OU=OS,O=OS,L=OS,ST=OS,C=NL,1.2.840.113549.1.9.1=#13026f73'

The certificate looks to be valid

gfdsa commented 2 months ago

it dies provide a proper message just before it crashes

kubectl -n opensearch logs -f opensearch-operator-controller-manager-6bd4fcb57f-9znlk operator-controller-manager | grep '^{'|jq 'select(.level=="error")'
{
  "level": "error",
  "ts": "2024-09-13T09:32:00.027Z",
  "msg": "Failed to create admin certificate",
  "controller": "opensearchcluster",
  "controllerGroup": "opensearch.opster.io",
  "controllerKind": "OpenSearchCluster",
  "OpenSearchCluster": {
    "name": "os-noprd",
    "namespace": "opensearch"
  },
  "namespace": "opensearch",
  "name": "os-noprd",
  "reconcileID": "ec29fbb0-e2be-410e-a1cd-4ee45cc457cb",
  "interface": "transport",
  "error": "tls: failed to find any PEM data in key input",
  "stacktrace": "github.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).createAdminSecret\n\t/workspace/pkg/reconcilers/tls.go:224\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).handleAdminCertificate\n\t/workspace/pkg/reconcilers/tls.go:116\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).Reconcile\n\t/workspace/pkg/reconcilers/tls.go:77\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).reconcilePhaseRunning\n\t/workspace/controllers/opensearchController.go:328\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).Reconcile\n\t/workspace/controllers/opensearchController.go:143\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"
}

I didn't look at the code but seems like admin cert depends on http's CA somehow? edit: from the transport tls of the docs: "If you provide your own node certificates you must also provide an admin cert that the operator can use for managing the cluster:". Should it be in the http tls section? Is the admin cert used to communicate to http or to transport?

edit2: from the same log: "admin cert is not signed by CA, recreating"

swoehrl-mw commented 1 month ago

Hi @gfdsa. Which CA the admin cert must be signed by (or is created from) depends on the opensearch version: For 2.x the http CA is used, for 1.x the transport CA. This relates to a change in opensearch where admin interaction (e.g. to update the securityconfig) is handled via the https port in 2.x and no longer via the transport port.

gfdsa commented 1 month ago

Ok, so the documentation is lagging behind the changes. I've got my cluster running creating the admin cert from our CA last week but had to remove all the secrets with certs to make it go smoothly

swoehrl-mw commented 1 month ago

Ok, so the documentation is lagging behind the changes. I've got my cluster running creating the admin cert from our CA last week but had to remove all the secrets with certs to make it go smoothly

I've never tried a situation with old existing certs and a new custom CA, so very possible that the operator could not completely handle that. Not really one of the core usecases.