openservicemesh / osm

Open Service Mesh (OSM) is a lightweight, extensible, cloud native service mesh that allows users to uniformly manage, secure, and get out-of-the-box observability features for highly dynamic microservice environments.
https://openservicemesh.io/
Apache License 2.0
2.59k stars 277 forks source link

Add more robust CRD conversion patching #5300

Closed keithmattix closed 1 year ago

keithmattix commented 1 year ago

Change the reoncile label to refer to the version of OSM that is in charge of reconciling the CRDs. This ensures that on upgrade, the new version is the only version managing the CRDs

Description: Fixes the upgrade bug from pre-v1.x.3 patches

Testing done: E2E tests

Affected area: Functional Area
Control Plane [x]
Install [x]
Upgrade [x]

Please answer the following questions with yes/no.

  1. Does this change contain code from or inspired by another project? no

    • Did you notify the maintainers and provide attribution? N/A
  2. Is this a breaking change? no

  3. Has documentation corresponding to this change been updated in the osm-docs repo (if applicable)? N/A

codecov-commenter commented 1 year ago

Codecov Report

Merging #5300 (4e6da88) into main (dc3f841) will decrease coverage by 0.02%. The diff coverage is 0.00%.

:exclamation: Current head 4e6da88 differs from pull request most recent head b3b0f9a. Consider uploading reports for the commit b3b0f9a to get more accurate results

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #5300      +/-   ##
==========================================
- Coverage   69.53%   69.52%   -0.02%     
==========================================
  Files         197      197              
  Lines       16070    16071       +1     
==========================================
- Hits        11174    11173       -1     
- Misses       4839     4841       +2     
  Partials       57       57              
Flag Coverage Δ
unittests 69.52% <0.00%> (-0.02%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/cli/version.go 43.96% <0.00%> (ø)
cmd/osm-bootstrap/osm-bootstrap.go 45.42% <0.00%> (ø)
pkg/reconciler/client.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

steeling commented 1 year ago

Can we also set rollout strategy on the bootstrap deployment to replace?

keithmattix commented 1 year ago

Can we also set rollout strategy on the bootstrap deployment to replace?

I'm not a huge fan of that change honestly. Replace effectively guarantees downtime in the gap between the old replicaset going down and the new one becoming Ready

steeling commented 1 year ago

Can we also set rollout strategy on the bootstrap deployment to replace?

I'm not a huge fan of that change honestly. Replace effectively guarantees downtime in the gap between the old replicaset going down and the new one becoming Ready

The only thing the bootstrap pod does after it starts up (after it creates/updates CRD's on startup), is the reconciler. The bootstrap pod should really be a job anyways. IMO. Not blocking on this though

Technically it has a webhook for:

  1. metrics
  2. version reporting
  3. health checks
keithmattix commented 1 year ago

@steeling it's those webhooks I'm worried about, especially during upgrades. Any client or operator of an OSM resource that's using an older CRD version will end up triggering a conversion request to OSM bootstrap which, for a time, wouldn't have any replicas able to serve traffic

steeling commented 1 year ago

@steeling it's those webhooks I'm worried about, especially during upgrades. Any client or operator of an OSM resource that's using an older CRD version will end up triggering a conversion request to OSM bootstrap which, for a time, wouldn't have any replicas able to serve traffic

I'm not sure a couple seconds of downtime on a small subset of operations is necessarily a bad thing during an upgrade, given the issues we've had here. Adding further complexity to the code, and banking on allowing an old reconciler to flip/flop back to prior states is undesirable IMO

keithmattix commented 1 year ago

I'm not sure a couple seconds of downtime on a small subset of operations is necessarily a bad thing during an upgrade, given the issues we've had here. Adding further complexity to the code, and banking on allowing an old reconciler to flip/flop back to prior states is undesirable IMO

If anyone builds on top of OSM, we've now introduced errors/downtime in that upgrade path. Downtime during upgrade can be ok if all parties are aware and plan for it. To me, making that kind of change in a patch release is unexpected and error prone

jaellio commented 1 year ago

Hey @keithmattix, could you update the PR description to describe the osm version label approach? Thanks!

keithmattix commented 1 year ago

Good catch @jaellio! Changed it