Closed malayparida2000 closed 1 month ago
/hold
/cc @travisn
@parth-gr Can you also confirm if the external cluster script collects the v2 endpoint (port 3300) when detecting the mons? Commonly I would expect both v1 and v2 to be available, so we just want to start relying exclusively on the v2 port.
For external mode we wouldn't be requiring msgr2 from the ceph components, it will only change the clients to start using the msgr2 port. By default, we will just be changing the port, and the existing volumes can continue using the v1 port until the next time they are mounted. So I would expect no issues with this transition, but certainly we should pay extra attention during upgrade testing.
Does the existing PVs/VSCs continue to work if ms_mode is changed to prefer_crc from legacy, before 4.17 we (provider) weren't using this option and so it was default.
LGTM and I'd like a quick opinion from @Madhu-1, thanks.
Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning. So any new volumes that would be mounted after this change will use the prefer-crc mount option.
@parth-gr: changing LGTM is restricted to collaborators
Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning.
Does that means v2 port can still mount/remount the PVC that is created by v1 port??
@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?
@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?
By support, I have to check with the support team may be. But yes on a RHCS cluster msgrv2 port can be disabled if wanted by the customer, so technically a RHCS cluster can be of v1 port only.
Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning.
Does that means v2 port can still mount/remount the PVC that is created by v1 port??
Yes if remounting happens then the volume will now mount using the v2 port
@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?
By support, I have to check with the support team may be. But yes on a RHCS cluster msgrv2 port can be disabled if wanted by the customer, so technically a RHCS cluster can be of v1 port only.
we need to check on this one before making this default.
Just an update on the discussion here, Me & Travis did get in touch with Federico(Ceph PM) & Orit. Their opinion is on the RHCS cluster since nautilus(14.2) the msgrv2 port is by default available & in use. There is no reasonable general reason for v2 port being turned off/unavailable on RHCS cluster. So we should be safe in making the v2 port default for ODF. We are also in touch with Illya about this & would soon make a decision on this.
Just an update on the discussion here, Me & Travis did get in touch with Federico(Ceph PM) & Orit. Their opinion is on the RHCS cluster since nautilus(14.2) the msgrv2 port is by default available & in use. There is no reasonable general reason for v2 port being turned off/unavailable on RHCS cluster. So we should be safe in making the v2 port default for ODF. We are also in touch with Illya about this & would soon make a decision on this.
Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters.
We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2
is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?
Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters. We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?
For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it RequireMsgr2
: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119
So we should first default the collection of the v2 port from the Python script and we may also add the CR spec RequireMsgr2
to true
Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters. We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?
For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it
RequireMsgr2
: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119So we should first default the collection of the v2 port from the Python script and we may also add the CR spec
RequireMsgr2
totrue
@parth-gr My question is a bit different, here are a few points to consider:
RequireMsgr2: true
for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2. RequireMsgr2: true
, force the v2 port for the external mon endpoints. Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters. We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?
For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it
RequireMsgr2
: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119 So we should first default the collection of the v2 port from the Python script and we may also add the CR specRequireMsgr2
totrue
@parth-gr My question is a bit different, here are a few points to consider:
- Upstream we don't want to default to v2, only for downstream
- OCS operator (with this PR) is planning to set
RequireMsgr2: true
for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.- The code you linked is perfect for upstream, but I think we need to add another check such that if
RequireMsgr2: true
, force the v2 port for the external mon endpoints.
To add we also might need to check,
@parth-gr My question is a bit different, here are a few points to consider:
- Upstream we don't want to default to v2, only for downstream
- OCS operator (with this PR) is planning to set
RequireMsgr2: true
for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.- The code you linked is perfect for upstream, but I think we need to add another check such that if
RequireMsgr2: true
, force the v2 port for the external mon endpoints.
For downstream we can use the RequireMsgr2: true
info, but the problem is with extracting the data,
How do we make sure the downstream python script always scrapes the v2 port
For downstream we can use the
RequireMsgr2: true
info, but the problem is with extracting the data, How do we make sure the downstream python script always scrapes the v2 port
To capture our discussion:
For downstream we can use the
RequireMsgr2: true
info, but the problem is with extracting the data, How do we make sure the downstream python script always scrapes the v2 portTo capture our discussion:
- We will make the change in Rook to enforce the v2 port for external clusters if requiremsgr2 is true
- The external python script will be changed in the downstream-only branch to only use the v2 port.
From Slack by Parth
From two changes we discussed,
I think we don't the rook upstream change
As we always check for the cluster.Spec.RequireMsgr2() https://github.com/rook/rook/blob/master/pkg/operator/ceph/cluster/cluster_external.go#L126
If it set from the CR still it will work with the same check
@parth-gr this should mean this PR is now no longer blocked from external mode perspective, right?
@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block https://github.com/red-hat-storage/rook/pull/748
@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block red-hat-storage/rook#748
Sounds good to do a test, will do it.
@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block red-hat-storage/rook#748
Sounds good to do a test, will do it.
Currently, there is some infra issue with the availability of external mode setups from QE & anyways, the scenario we are concerned about is the behavior of the cluster upon upgrade from 4.17 to 4.18 which will be easy to test when this change is officially part of the build. And anyway, I have to document the whole process as well, so I will do the e2e testing of the flow after the PR is merged & part of the build.
/unhold
/lgtm
@parth-gr: changing LGTM is restricted to collaborators
The PR is now LGTM from both the Internal Mode & External Mode perspective. @nb-ohad / @leelavg Can you please ack LGTM from the Provider-Client Mode perspective so we can finally merge this?
LGTM, since Ohad has an unresolved comment need to wait for that to be resolved.
Ohad will be on PTO till the 21st of Oct. This is an important change & with more delay to its merge we are losing precious time we might need to make further changes based on the result of this. This PR also blocks other stories, thereby blocking the Epic's progress. As we have acks from all Povs including Provider mode from Leela, We should go ahead and merge this. /cc @travisn
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: malayparida2000, parth-gr, travisn
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/retest
/retest
/retest
/test ocs-operator-bundle-e2e-aws
/retest
/override ci/prow/ocs-operator-bundle-e2e-aws
@iamniting: Overrode contexts on behalf of iamniting: ci/prow/ocs-operator-bundle-e2e-aws
@malayparida2000 We missed rebasing it against main and running make deps-update, there are changes in metrics/vendor/ from this PR in my local
@malayparida2000 We missed rebasing it against main and running make deps-update, there are changes in metrics/vendor/ from this PR in my local
Didn't understand, How does this PR affect the vendor folder?
Initially in 4.13 when encryption in-transit was implemented we started to use msgrv2 for internal mode only. External mode & Provider mode were excluded.
For external mode, the concern was the availability of msgrv2 port always on the RHCS cluster. But now we have reached ceph v18 & msgrv2 is being by default on, since ceph Nautilus 14.2. So we can now use mgsr2 for external mode safely.
For provider mode, the concern was the availability of the required kernel versions on old client clusters. But now that concern also seems no longer valid. So we can now use msgrv2 for provider mode as well.
Also as we would always use msgrv2, we can now simplify the GetCephFSKernelMountOptions function to remove the conditional checks. If encryption is enabled use "secure" otherwise "prefer-crc".