percona / percona-server-mongodb-operator

Percona Operator for MongoDB
https://www.percona.com/doc/kubernetes-operator-for-psmongodb/
Apache License 2.0
321 stars 138 forks source link

K8SPSMDB-755: Fix tlsMode for mongos #1540

Closed egegunes closed 5 months ago

egegunes commented 5 months ago

K8SPSMDB-755 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem: Short explanation of the problem.

Cause: Short explanation of the root cause of the issue if applicable.

Solution: Short explanation of the solution we are providing with this PR.

CHECKLIST

Jira

Tests

Config/Logging/Testability

JNKPercona commented 5 months ago
Test name Status
arbiter passed
balancer passed
custom-replset-name passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
multi-cluster-service passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
pvc-resize passed
recover-no-primary passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 48 out of 48

commit: https://github.com/percona/percona-server-mongodb-operator/pull/1540/commits/1d9c93792f7264d46c30b78bcbb1b947d0951de9 image: perconalab/percona-server-mongodb-operator:PR-1540-1d9c9379

kantorcodes commented 5 months ago

I believe this PR has broken running instances on CR 1.15.0 once the branch is pulled down - has this been tested?

hors commented 5 months ago

I believe this PR has broken running instances once the branch is pulled down - has this been tested?

It was tested by our e2e tests. After the merge our QA team will perform tests. Please do not use main branch for production needs. It can be unstable.

hors commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

kantorcodes commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

On CR 1.16.0, cfg0-3 start, however mongos-0 reports: "Host failed in replica set" and "Error connecting to XX.XX.XX"

On CR 1.15.0, cfg-0 reports: "/opt/percona/ps-entry.sh: line 522: exec: numactl --interleave=all: not found" and mongos-0 does not start at all.

kantorcodes commented 5 months ago

Do you have a recommended setup for running without TLS for the following variables?

spec.image in cr.yaml upgradeOptions.apply in cr.yaml CR in cr.yaml spec.containers.image in bundle.yaml

hors commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

On CR 1.15.0, cfg-0 reports: "/opt/percona/ps-entry.sh: line 522: exec: numactl --interleave=all: not found" and mongos-0 does not start at all.

As you can see from release notes PSMDB 1.15 operator was tested with MongoDB 4.4.24, 5.0.20, and 6.0.9 and numactl was added to these docker files. https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

kantorcodes commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml and how do we ensure the code for the operator in bundle.yaml is using 1.15.0 ?

hors commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

kantorcodes commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

What would be the correct value I mean?

hors commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

What would be the correct value I mean?

Using this link, you can get the correct value as well :)

hors commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

On CR 1.16.0, cfg0-3 start, however mongos-0 reports: "Host failed in replica set" and "Error connecting to XX.XX.XX"

Did you use the default CR? I can't reproduce it :(

kantorcodes commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

What would be the correct value I mean?

Using this link, you can get the correct value as well :)

Utilizing these combinations with TLS disabled, I am getting the following error.

{"t":{"$date":"2024-05-05T15:51:53.046Z"},"s":"F", "c":"CONTROL", "id":20574, "ctx":"-","msg":"Error during global initialization","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"need to enable TLS via the sslMode/tlsMode flag when using TLS configuration parameters"}}}

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.

spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate
kantorcodes commented 5 months ago

Would specifying initImage be helpful with this new edge case?

hors commented 5 months ago

Utilizing these combinations with TLS disabled, I am getting the following error.

{"t":{"$date":"2024-05-05T15:51:53.046Z"},"s":"F", "c":"CONTROL", "id":20574, "ctx":"-","msg":"Error during global initialization","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"need to enable TLS via the sslMode/tlsMode flag when using TLS configuration parameters"}}}

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.

spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate

Ok, thanks for CR. We will check it tomorrow in the morning.

kantorcodes commented 5 months ago

Utilizing these combinations with TLS disabled, I am getting the following error. {"t":{"$date":"2024-05-05T15:51:53.046Z"},"s":"F", "c":"CONTROL", "id":20574, "ctx":"-","msg":"Error during global initialization","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"need to enable TLS via the sslMode/tlsMode flag when using TLS configuration parameters"}}} Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.

spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate

Ok, thanks for CR. We will check it tomorrow in the morning.

Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.

hors commented 5 months ago

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.

spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate

Ok, thanks for CR. We will check it tomorrow in the morning.

Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.

Please do not use main branch for production. It was not tested by QA team. We run all needed tests before the release. You only need to use officially released versions of our operators.

kantorcodes commented 5 months ago

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.

spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate

Ok, thanks for CR. We will check it tomorrow in the morning.

Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.

Please do not use main branch for production. It was not tested by QA team. We run all needed tests before the release. You only need to use officially released versions of our operators.

Understood --- however, despite switching off the main branch, the issue still persists, and looks like it can be replicated on a fresh setup as well.

hors commented 5 months ago

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.


spec:

#  platform: openshift

#  clusterServiceDNSSuffix: svc.cluster.local

  clusterServiceDNSMode: "External"

#  pause: true

#  unmanaged: false

  crVersion: 1.15.0

  image: percona/percona-server-mongodb:6.0.9-7

  imagePullPolicy: Always

  unsafeFlags:

    tls: true

  tls:

    allowInvalidCertificates: true

    mode: disabled

  #   enabled: false

#    # 90 days in hours

#    certValidityDuration: 2160h

#  imagePullSecrets:

#    - name: private-registry-credentials

#  initImage: perconalab/percona-server-mongodb-operator:main

#  initContainerSecurityContext: {}

  allowUnsafeConfigurations: true

  updateStrategy: SmartUpdate

Ok, thanks for CR. We will check it tomorrow in the morning.

Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.

Please do not use main branch for production. It was not tested by QA team. We run all needed tests before the release. You only need to use officially released versions of our operators.

Understood --- however, despite switching off the main branch, the issue still persists, and looks like it can be replicated on a fresh setup as well.

It was merged into main branch only. It can't affect any versions which were related before. The v1.15.0 CRDs do not have new options and old operator does not have a code which were added in main. Before the official release we will test this new options very carefully to be sure that new operator can work with old CR version.