Open mohitrajain opened 2 months ago
@mohitrajain Since the issue is fixed in the latest version as you mentioned, so why not upgrade to the fixed version?
@subhamkrai We need to evaluate the new version in our staging cluster. However, before applying changes to the production cluster, having this feature would be beneficial. It would enable us to preserve manual adjustments on specific pools.
@subhamkrai It is indeed nice that this has been fixed. One unclear question is how many Rook releases one can jump at a time.
But as to why we ask for this setting -- this particular situation may be addressed, but what about the ones we haven't hit yet? Especially if future Rook releases grow additional actions on existing pools or other resources.
Why had we done manual changes?
.mgr
pool gets created automatically without deviceclass specification, so when data pools are created constrained to a device class, the pg autoscaler can't cope because of the shadow CRUSH root. Mind you we disable the autoscaler anyway or at least set it to warn. So we had manually changed the CRUSH rule for the .mgr
pool to a different rule that specified device class -- at the time, it was my understanding that Rook never modified resources already created. So this new action on Rook's part is a change from prior behavior, one that has the potential to be impactfulOBTW, Mohit is my protege; we work together.
@anthonyeleven So you're hitting this issue in v1.12.8, and you would prefer to get the fix in a v1.12.x release before you upgrade to v1.13? We could likely backport #13772 to v1.12 and make that happen.
No, not asking for a backport. The 1.13.x fix should suffice if we can skip 1.12.8. The cluster where this matters for us currently runs Rook v1.10.5. Can we go straight to 1.13.x from this release?
The RFE to completely disable modifying existing pools is intended to forestall any future rude awakenings when new functionality is added.
in the upgrade guide it is recommended not to skip minor rook releases. It might work, but I'd recommend testing the upgrade yourself. Our testing does not cover skipping minor releases. Instead, you might consider: v1.10.5 --> v1.11.x --> v1.12.7 --> v1.13.x
Thanks!
We're testing upgrades in a small lab cluster, but there's only so closely we can align that to prod. v1.10.5 --> v1.11.x --> v1.12.7 --> v1.13.x
should be feasible for us.
I didn't want to make assumptions about the .. releases
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
Is this a bug report or feature request?
What should the feature do: In v1.12.8, a feature was released to update the pool deviceClass: pool: allow updating deviceClass on existing pool.
This feature request proposes an additional flag that allows administrators to opt for the proposed changes from the aforementioned change. Optionally, it could log the recommended state changes for pools upon each evaluation, providing insights into required adjustments. Our clusters are dynamic, with deviceClasses added at later stages. Since the aforementioned feature was not available previously, manual adjustments were made. We aim to avoid disrupting the production cluster by preventing re-mapping of objects due to new crush rules.
What is use case behind this feature: After updating our staging Rook cluster to v1.12.8, which had undergone manual pool modifications in the past, we encountered some unexpected changes:
The message described above was observed for all the pools in our cluster. Initially, we had omitted specifying the device class for some pools, which subsequently were affected, as indicated by the logs.
The above crush rule changes led to data being re-mapped across the cluster. It's noteworthy that this issue has been addressed in version 1.13.5 with the implementation of "pool: Skip crush rule update when not needed."
We aim to mitigate any unintended effects on existing pools without a comprehensive understanding of the implications on crush rules. Implementing logging and a toggle flag would facilitate more informed decisions.
Environment: Mix device class environment.