openservicemesh / osm

Open Service Mesh (OSM) is a lightweight, extensible, cloud native service mesh that allows users to uniformly manage, secure, and get out-of-the-box observability features for highly dynamic microservice environments.
https://openservicemesh.io/
Apache License 2.0
2.59k stars 277 forks source link

osm-controller should set initial status/conditions #5045

Closed keithmattix closed 2 years ago

keithmattix commented 2 years ago

Based on Step 2 in the design doc, osm-controller should monitor the MRCAdded event and initiate the status fields (including conditions). Furthermore, osm-controller should confirm the connection to the issuer and update the Accepted condition accordingly (error or success).

Be aware that conflicts may occur when multiple osm-controller replicas attempt to update status at the same time; consider using RetryOnConflict to query k8s for the latest version of the resource, add/update the status if it hasn't been done already (return early if it has), and retry with backoff if a conflict error occurs.

Refer to #4848 for examples of where (handleMRCEvent in manager.go) and how to respond to MRC events. Note that #4848 is based on an old version of MRC statuses and state changes, but the general approach is still applicable.

An MRCAdded event will be received in the following scenarios:

  1. On osm install, an MRC resource is created by OSM. The status fields have not been set. When a CRD is created, the status fields cannot be set since the status is a subresource of the CRD. If there is only 1 MRC and the intent is spec.intent=active
    • Attempt to set the following. If a conflict occurs, check if there is an error on status.state or if it is already set. Retry if not set and no error
      • status.state should be set to Pending
      • status.certificateStatuses should all be set to Unknown
      • status.conditions Accepted should be set to status=False and reason=Pending
    • (OUT OF SCOPE FOR THIS ISSUE) Once issuer has been successfully created, attempt to update the following and retry similar to above
      • status.state should be set to Issuing
      • status.certificateStatuses should all be set to Issuing
      • status.conditions Accepted should be set to status=True and reason=CertificateAccepted
      • status.conditions IssuingRollout should be set to status=True and reason=CertificateInUseForIssuing
      • status.conditions ValidatingRollout should be set to status=True and reason=CertificateInUseForValidating
      • status.conditions Ready should be set to status=True and reason=RotationComplete
  2. On a control plane component restart, the component will receive an MRCAdded event. From the components perspective, the informer cache for the MRC was just populated and the MRC was just added to the cache. The status fields have already been set on the MRC. If a root certificate rotation is in progress, the component will receive MRCAdded events for each MRC. - No action is required if the status field is populated on the MRC resource
    • If there is only one MRC and the status field is not set follow the steps outlined in scenario 1
    • If there are 2 MRCs and the MRCAdded event is for the MRC with spec.intent=passive,
      • status.state should be set to Pending
      • status.certificateStatuses should all be set to Unknown
      • status.conditions Accepted should be set to status=False and reason=Pending
        1. To initiate root certificate rotation, a new MRC is created and the control plane components will receive an MRCAdded event. The status fields have not been set and there is another MRC with spec.intent=active
    • status.state should be set to Pending
    • status.certificateStatuses should all be set to Unknown
    • status.conditions Accepted should be set to status=False and reason=Pending

Implementation

NOTE: For the in-scope scenarios, all status updates are the same. The implementation could be something like the following:

  1. Receive MRC add event
  2. Check there are no more than 2 MRCs (mrcClient.ListMeshRootCertificates()). If there are more than 2 the add event should be ignored. This might happen if multiple MRCs are created at the same time and therefore all pass the validating webhook
  3. Call RetryOnConflict. Define retry func
    1. Get the MRC
    2. Check that the status is nil, and if it is not nil return nil
    3. Set the status
    4. Update the MRC status (mrcClient.UpdateMeshRootCertificateStatus)
    5. Return err value (either error or nil)
  4. If RetryOnConflict failed, log the error (this is a fatal error - not sure how this should be handled?)
  5. If RetryOnConflict exceeded continue execution
jaellio commented 2 years ago

Dependent on #5044

jaellio commented 2 years ago

Implementation

NOTE: For the in-scope scenarios, all status updates are the same. The implementation could be something like the following:

  1. Receive MRC add event
  2. Check there are no more than 2 MRCs (mrcClient.ListMeshRootCertificates()). If there are more than 2 the add event should be ignored. This might happen if multiple MRCs are created at the same time and therefore all pass the validating webhook
  3. Call RetryOnConflict. Define retry func

    1. Get the MRC
    2. Check that the status is nil, and if it is not nil return nil
    3. Set the status
    4. Update the MRC status (mrcClient.UpdateMeshRootCertificateStatus)
    5. Return err value (either error or nil)
  4. If RetryOnConflict failed, log the error (this is a fatal error - not sure how this should be handled?)
  5. If RetryOnConflict exceeded continue execution

@keithmattix What are your thoughts on how we should handle step 2? Should the validating webhook be responsible for limiting the number of active/passive MRCs that are added? If somehow more than 2 MRCs exist, how should the controller respond? Should it set a special condition on the MRC, or should the MRC be in an error state?

keithmattix commented 2 years ago

The validating webhook should limit MRCs to 2. If by some chance, an MRC gets past the webhook, the controller should ignore add events if there are 2 MRCs with status/conditions set

steeling commented 2 years ago

With this + upgrades we have more reasons for an internal state object. With 2 phase commit we can completely avoid the scenario of having multiple mesh certs in the wrong state

keithmattix commented 2 years ago

With this + upgrades we have more reasons for an internal state object. With 2 phase commit we can completely avoid the scenario of having multiple mesh certs in the wrong state

I'm not sold on 2pc honestly; from my light reading, I have concerns about reads during the "promise" stage of the commit. The saga pattern seems more widely adopted, and something like nats would let us implement that, but honestly, the more prevalent pattern in Kubernetes is a lock/lease for each control plane component (each of which could have multiple replicas) so that there's only one writer. I'd much prefer that approach because we could guarantee that the writer's view of the world is correct.

keithmattix commented 2 years ago

Closing in favor of https://github.com/openservicemesh/osm/issues/5179