K8SPSMDB-755: Fix tlsMode for mongos

egegunes commented 5 months ago

CHANGE DESCRIPTION

Problem: Short explanation of the problem.

Cause: Short explanation of the root cause of the issue if applicable.

Solution: Short explanation of the solution we are providing with this PR.

CHECKLIST

Jira

[ ] Is the Jira ticket created and referenced properly?
[ ] Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
[ ] Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

[ ] Is an E2E test/test case added for the new feature/change?
[ ] Are unit tests added where appropriate?
[ ] Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

[ ] Are all needed new/changed options added to default YAML files?
[ ] Did we add proper logging messages for operator actions?
[ ] Did we ensure compatibility with the previous version or cluster upgrade process?
[ ] Does the change support oldest and newest supported MongoDB version?
[ ] Does the change support oldest and newest supported Kubernetes version?

JNKPercona commented 5 months ago

Test name	Status
arbiter	passed
balancer	passed
custom-replset-name	passed
cross-site-sharded	passed
data-at-rest-encryption	passed
data-sharded	passed
demand-backup	passed
demand-backup-eks-credentials	passed
demand-backup-physical	passed
demand-backup-physical-sharded	passed
demand-backup-sharded	passed
expose-sharded	passed
ignore-labels-annotations	passed
init-deploy	passed
finalizer	passed
ldap	passed
ldap-tls	passed
limits	passed
liveness	passed
mongod-major-upgrade	passed
mongod-major-upgrade-sharded	passed
monitoring-2-0	passed
multi-cluster-service	passed
non-voting	passed
one-pod	passed
operator-self-healing-chaos	passed
pitr	passed
pitr-sharded	passed
pitr-physical	passed
pvc-resize	passed
recover-no-primary	passed
rs-shard-migration	passed
scaling	passed
scheduled-backup	passed
security-context	passed
self-healing-chaos	passed
service-per-pod	passed
serviceless-external-nodes	passed
smart-update	passed
split-horizon	passed
storage	passed
tls-issue-cert-manager	passed
upgrade	passed
upgrade-consistency	passed
upgrade-consistency-sharded-tls	passed
upgrade-sharded	passed
users	passed
version-service	passed
We run 48 out of 48

commit: https://github.com/percona/percona-server-mongodb-operator/pull/1540/commits/1d9c93792f7264d46c30b78bcbb1b947d0951de9 image: perconalab/percona-server-mongodb-operator:PR-1540-1d9c9379

kantorcodes commented 5 months ago

I believe this PR has broken running instances on CR 1.15.0 once the branch is pulled down - has this been tested?

hors commented 5 months ago

I believe this PR has broken running instances once the branch is pulled down - has this been tested?

It was tested by our e2e tests. After the merge our QA team will perform tests. Please do not use main branch for production needs. It can be unstable.

hors commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

kantorcodes commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

On CR 1.16.0, cfg0-3 start, however mongos-0 reports: "Host failed in replica set" and "Error connecting to XX.XX.XX"

On CR 1.15.0, cfg-0 reports: "/opt/percona/ps-entry.sh: line 522: exec: numactl --interleave=all: not found" and mongos-0 does not start at all.

kantorcodes commented 5 months ago

Do you have a recommended setup for running without TLS for the following variables?

spec.image in cr.yaml upgradeOptions.apply in cr.yaml CR in cr.yaml spec.containers.image in bundle.yaml

hors commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

On CR 1.15.0, cfg-0 reports: "/opt/percona/ps-entry.sh: line 522: exec: numactl --interleave=all: not found" and mongos-0 does not start at all.

As you can see from release notes PSMDB 1.15 operator was tested with MongoDB 4.4.24, 5.0.20, and 6.0.9 and numactl was added to these docker files. https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

kantorcodes commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml and how do we ensure the code for the operator in bundle.yaml is using 1.15.0 ?

hors commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

kantorcodes commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

What would be the correct value I mean?

hors commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

What would be the correct value I mean?

Using this link, you can get the correct value as well :)

hors commented 5 months ago

@kantorcodes could you please provide your CR and we will test your case as well.

On CR 1.16.0, cfg0-3 start, however mongos-0 reports: "Host failed in replica set" and "Error connecting to XX.XX.XX"

Did you use the default CR? I can't reproduce it :(

kantorcodes commented 5 months ago

https://docs.percona.com/percona-operator-for-mongodb/RN/Kubernetes-Operator-for-PSMONGODB-RN1.15.0.html#supported-platforms:~:text=MongoDB%204.4.24%2C%205.0.20%2C%20and%206.0.9

How would we force version 6.0.9 when specifying spec.image in cr.yaml ?

You can set it via https://github.com/percona/percona-server-mongodb-operator/blob/v1.15.0/deploy/cr.yaml#L15 option

What would be the correct value I mean?

Using this link, you can get the correct value as well :)

Utilizing these combinations with TLS disabled, I am getting the following error.

{"t":{"$date":"2024-05-05T15:51:53.046Z"},"s":"F", "c":"CONTROL", "id":20574, "ctx":"-","msg":"Error during global initialization","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"need to enable TLS via the sslMode/tlsMode flag when using TLS configuration parameters"}}}

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.

spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate

kantorcodes commented 5 months ago

Would specifying initImage be helpful with this new edge case?

hors commented 5 months ago

Utilizing these combinations with TLS disabled, I am getting the following error.

{"t":{"$date":"2024-05-05T15:51:53.046Z"},"s":"F", "c":"CONTROL", "id":20574, "ctx":"-","msg":"Error during global initialization","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"need to enable TLS via the sslMode/tlsMode flag when using TLS configuration parameters"}}}

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.
spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate

Ok, thanks for CR. We will check it tomorrow in the morning.

kantorcodes commented 5 months ago

Utilizing these combinations with TLS disabled, I am getting the following error. {"t":{"$date":"2024-05-05T15:51:53.046Z"},"s":"F", "c":"CONTROL", "id":20574, "ctx":"-","msg":"Error during global initialization","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"need to enable TLS via the sslMode/tlsMode flag when using TLS configuration parameters"}}} Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.
spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate
Ok, thanks for CR. We will check it tomorrow in the morning.

Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.

hors commented 5 months ago

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.
spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate
Ok, thanks for CR. We will check it tomorrow in the morning.
Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.

Please do not use main branch for production. It was not tested by QA team. We run all needed tests before the release. You only need to use officially released versions of our operators.

kantorcodes commented 5 months ago

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.
spec:
#  platform: openshift
#  clusterServiceDNSSuffix: svc.cluster.local
  clusterServiceDNSMode: "External"
#  pause: true
#  unmanaged: false
  crVersion: 1.15.0
  image: percona/percona-server-mongodb:6.0.9-7
  imagePullPolicy: Always
  unsafeFlags:
    tls: true
  tls:
    allowInvalidCertificates: true
    mode: disabled
  #   enabled: false
#    # 90 days in hours
#    certValidityDuration: 2160h
#  imagePullSecrets:
#    - name: private-registry-credentials
#  initImage: perconalab/percona-server-mongodb-operator:main
#  initContainerSecurityContext: {}
  allowUnsafeConfigurations: true
  updateStrategy: SmartUpdate
Ok, thanks for CR. We will check it tomorrow in the morning.
Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.
Please do not use main branch for production. It was not tested by QA team. We run all needed tests before the release. You only need to use officially released versions of our operators.

Understood --- however, despite switching off the main branch, the issue still persists, and looks like it can be replicated on a fresh setup as well.

hors commented 5 months ago

Happy to hop on a video call if you're willing to dissect this together further. Would that be helpful? Note, unsafeFlags and tls were added after this PR went up. Simply specifying allowUnsafeConfigurations worked previously, I suspect a smart update cascaded issues here.
spec:

#  platform: openshift

#  clusterServiceDNSSuffix: svc.cluster.local

  clusterServiceDNSMode: "External"

#  pause: true

#  unmanaged: false

  crVersion: 1.15.0

  image: percona/percona-server-mongodb:6.0.9-7

  imagePullPolicy: Always

  unsafeFlags:

    tls: true

  tls:

    allowInvalidCertificates: true

    mode: disabled

  #   enabled: false

#    # 90 days in hours

#    certValidityDuration: 2160h

#  imagePullSecrets:

#    - name: private-registry-credentials

#  initImage: perconalab/percona-server-mongodb-operator:main

#  initContainerSecurityContext: {}

  allowUnsafeConfigurations: true

  updateStrategy: SmartUpdate
Ok, thanks for CR. We will check it tomorrow in the morning.
Would you have any ideas for a stopgap solution in production? I believe a smart update could affect other servers running with a similar setup and bring them down.
Please do not use main branch for production. It was not tested by QA team. We run all needed tests before the release. You only need to use officially released versions of our operators.
Understood --- however, despite switching off the main branch, the issue still persists, and looks like it can be replicated on a fresh setup as well.

It was merged into main branch only. It can't affect any versions which were related before. The v1.15.0 CRDs do not have new options and old operator does not have a code which were added in main. Before the official release we will test this new options very carefully to be sure that new operator can work with old CR version.

percona / percona-server-mongodb-operator

K8SPSMDB-755: Fix tlsMode for mongos #1540

CHANGE DESCRIPTION

CHECKLIST