red-hat-storage / ocs-operator

Operator for RHOCS
Apache License 2.0
85 stars 184 forks source link

Use msgrv2 always for all ODF modes(internal,external,provider) #2793

Closed malayparida2000 closed 1 month ago

malayparida2000 commented 2 months ago

Initially in 4.13 when encryption in-transit was implemented we started to use msgrv2 for internal mode only. External mode & Provider mode were excluded.

For external mode, the concern was the availability of msgrv2 port always on the RHCS cluster. But now we have reached ceph v18 & msgrv2 is being by default on, since ceph Nautilus 14.2. So we can now use mgsr2 for external mode safely.

For provider mode, the concern was the availability of the required kernel versions on old client clusters. But now that concern also seems no longer valid. So we can now use msgrv2 for provider mode as well.

Also as we would always use msgrv2, we can now simplify the GetCephFSKernelMountOptions function to remove the conditional checks. If encryption is enabled use "secure" otherwise "prefer-crc".

malayparida2000 commented 2 months ago

/hold

malayparida2000 commented 2 months ago

/cc @travisn

travisn commented 2 months ago

@parth-gr Can you also confirm if the external cluster script collects the v2 endpoint (port 3300) when detecting the mons? Commonly I would expect both v1 and v2 to be available, so we just want to start relying exclusively on the v2 port.

travisn commented 2 months ago

For external mode we wouldn't be requiring msgr2 from the ceph components, it will only change the clients to start using the msgr2 port. By default, we will just be changing the port, and the existing volumes can continue using the v1 port until the next time they are mounted. So I would expect no issues with this transition, but certainly we should pay extra attention during upgrade testing.

malayparida2000 commented 2 months ago

Does the existing PVs/VSCs continue to work if ms_mode is changed to prefer_crc from legacy, before 4.17 we (provider) weren't using this option and so it was default.

LGTM and I'd like a quick opinion from @Madhu-1, thanks.

Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning. So any new volumes that would be mounted after this change will use the prefer-crc mount option.

openshift-ci[bot] commented 2 months ago

@parth-gr: changing LGTM is restricted to collaborators

In response to [this](https://github.com/red-hat-storage/ocs-operator/pull/2793#pullrequestreview-2299266965): > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
parth-gr commented 2 months ago

Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning.

Does that means v2 port can still mount/remount the PVC that is created by v1 port??

Madhu-1 commented 2 months ago

@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?

malayparida2000 commented 2 months ago

@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?

By support, I have to check with the support team may be. But yes on a RHCS cluster msgrv2 port can be disabled if wanted by the customer, so technically a RHCS cluster can be of v1 port only.

malayparida2000 commented 2 months ago

Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning.

Does that means v2 port can still mount/remount the PVC that is created by v1 port??

Yes if remounting happens then the volume will now mount using the v2 port

Madhu-1 commented 2 months ago

@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?

By support, I have to check with the support team may be. But yes on a RHCS cluster msgrv2 port can be disabled if wanted by the customer, so technically a RHCS cluster can be of v1 port only.

we need to check on this one before making this default.

malayparida2000 commented 1 month ago

Just an update on the discussion here, Me & Travis did get in touch with Federico(Ceph PM) & Orit. Their opinion is on the RHCS cluster since nautilus(14.2) the msgrv2 port is by default available & in use. There is no reasonable general reason for v2 port being turned off/unavailable on RHCS cluster. So we should be safe in making the v2 port default for ODF. We are also in touch with Illya about this & would soon make a decision on this.

travisn commented 1 month ago

Just an update on the discussion here, Me & Travis did get in touch with Federico(Ceph PM) & Orit. Their opinion is on the RHCS cluster since nautilus(14.2) the msgrv2 port is by default available & in use. There is no reasonable general reason for v2 port being turned off/unavailable on RHCS cluster. So we should be safe in making the v2 port default for ODF. We are also in touch with Illya about this & would soon make a decision on this.

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters.

We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

parth-gr commented 1 month ago

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters. We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it RequireMsgr2: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119

So we should first default the collection of the v2 port from the Python script and we may also add the CR spec RequireMsgr2 to true

travisn commented 1 month ago

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters. We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it RequireMsgr2: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119

So we should first default the collection of the v2 port from the Python script and we may also add the CR spec RequireMsgr2 to true

@parth-gr My question is a bit different, here are a few points to consider:

  1. Upstream we don't want to default to v2, only for downstream
  2. OCS operator (with this PR) is planning to set RequireMsgr2: true for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.
  3. The code you linked is perfect for upstream, but I think we need to add another check such that if RequireMsgr2: true, force the v2 port for the external mon endpoints.
malayparida2000 commented 1 month ago

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters. We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it RequireMsgr2: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119 So we should first default the collection of the v2 port from the Python script and we may also add the CR spec RequireMsgr2 to true

@parth-gr My question is a bit different, here are a few points to consider:

  1. Upstream we don't want to default to v2, only for downstream
  2. OCS operator (with this PR) is planning to set RequireMsgr2: true for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.
  3. The code you linked is perfect for upstream, but I think we need to add another check such that if RequireMsgr2: true, force the v2 port for the external mon endpoints.

To add we also might need to check,

  1. For upgraded clusters, When requireMsgr2 is true, we have to make sure the rok-ceph-csi-config CM changes the mon ports to v2 ports like it does for internal mode.
parth-gr commented 1 month ago

@parth-gr My question is a bit different, here are a few points to consider:

  1. Upstream we don't want to default to v2, only for downstream
  2. OCS operator (with this PR) is planning to set RequireMsgr2: true for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.
  3. The code you linked is perfect for upstream, but I think we need to add another check such that if RequireMsgr2: true, force the v2 port for the external mon endpoints.

For downstream we can use the RequireMsgr2: true info, but the problem is with extracting the data, How do we make sure the downstream python script always scrapes the v2 port

travisn commented 1 month ago

For downstream we can use the RequireMsgr2: true info, but the problem is with extracting the data, How do we make sure the downstream python script always scrapes the v2 port

To capture our discussion:

malayparida2000 commented 1 month ago

For downstream we can use the RequireMsgr2: true info, but the problem is with extracting the data, How do we make sure the downstream python script always scrapes the v2 port

To capture our discussion:

  • We will make the change in Rook to enforce the v2 port for external clusters if requiremsgr2 is true
  • The external python script will be changed in the downstream-only branch to only use the v2 port.

From Slack by Parth

From two changes we discussed,
I think we don't the rook upstream change
As we always check for the cluster.Spec.RequireMsgr2() https://github.com/rook/rook/blob/master/pkg/operator/ceph/cluster/cluster_external.go#L126
If it set from the CR still it will work with the same check

@parth-gr this should mean this PR is now no longer blocked from external mode perspective, right?

parth-gr commented 1 month ago

@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block https://github.com/red-hat-storage/rook/pull/748

malayparida2000 commented 1 month ago

@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block red-hat-storage/rook#748

Sounds good to do a test, will do it.

malayparida2000 commented 1 month ago

@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block red-hat-storage/rook#748

Sounds good to do a test, will do it.

Currently, there is some infra issue with the availability of external mode setups from QE & anyways, the scenario we are concerned about is the behavior of the cluster upon upgrade from 4.17 to 4.18 which will be easy to test when this change is officially part of the build. And anyway, I have to document the whole process as well, so I will do the e2e testing of the flow after the PR is merged & part of the build.

malayparida2000 commented 1 month ago

/unhold

parth-gr commented 1 month ago

/lgtm

openshift-ci[bot] commented 1 month ago

@parth-gr: changing LGTM is restricted to collaborators

In response to [this](https://github.com/red-hat-storage/ocs-operator/pull/2793#issuecomment-2410463497): >/lgtm Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
malayparida2000 commented 1 month ago

The PR is now LGTM from both the Internal Mode & External Mode perspective. @nb-ohad / @leelavg Can you please ack LGTM from the Provider-Client Mode perspective so we can finally merge this?

leelavg commented 1 month ago

LGTM, since Ohad has an unresolved comment need to wait for that to be resolved.

malayparida2000 commented 1 month ago

Ohad will be on PTO till the 21st of Oct. This is an important change & with more delay to its merge we are losing precious time we might need to make further changes based on the result of this. This PR also blocks other stories, thereby blocking the Epic's progress. As we have acks from all Povs including Provider mode from Leela, We should go ahead and merge this. /cc @travisn

travisn commented 1 month ago

/lgtm /approve

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: malayparida2000, parth-gr, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/red-hat-storage/ocs-operator/blob/main/OWNERS)~~ [travisn] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
malayparida2000 commented 1 month ago

/retest

malayparida2000 commented 1 month ago

/retest

malayparida2000 commented 1 month ago

/retest

malayparida2000 commented 1 month ago

/test ocs-operator-bundle-e2e-aws

malayparida2000 commented 1 month ago

/retest

iamniting commented 1 month ago

/override ci/prow/ocs-operator-bundle-e2e-aws

openshift-ci[bot] commented 1 month ago

@iamniting: Overrode contexts on behalf of iamniting: ci/prow/ocs-operator-bundle-e2e-aws

In response to [this](https://github.com/red-hat-storage/ocs-operator/pull/2793#issuecomment-2419088172): >/override ci/prow/ocs-operator-bundle-e2e-aws Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
rewantsoni commented 1 month ago

@malayparida2000 We missed rebasing it against main and running make deps-update, there are changes in metrics/vendor/ from this PR in my local

malayparida2000 commented 1 month ago

@malayparida2000 We missed rebasing it against main and running make deps-update, there are changes in metrics/vendor/ from this PR in my local

Didn't understand, How does this PR affect the vendor folder?