operator-framework / java-operator-sdk

Java SDK for building Kubernetes Operators
https://javaoperatorsdk.io/
Apache License 2.0
795 stars 214 forks source link

Ready postcondition doesn't receive status updates for managed dependent resources in managed workflow #1565

Closed grossws closed 1 year ago

grossws commented 2 years ago

Bug Report

Reconciler with managed workflow and managed dependent resources with ready post-condition doesn't progress after reconciling first dependent resource.

Resource with readyPostcondition is reconciled successfully but condition based on secondary is never met since it receives the same secondary resource cached at time of its reconcilation.

Tried it with both WATCH_ALL_NAMESPACES and WATCH_CURRENT_NAMESPACE.

What did you do?

Full reproducer: https://github.com/grossws/operatorsdk-es-issue

@ControllerConfiguration(
        name = "project-operator",
        dependents = {
                @Dependent(name = "first-svc", type = FirstService.class),
                @Dependent(name = "second-svc", type = SecondService.class),
                @Dependent(name = "first", type = FirstStatefulSet.class,
                        dependsOn = {"first-svc"},
                        readyPostcondition = MyReconciler.FirstReadyCondition.class),
                @Dependent(name = "second", type = SecondStatefulSet.class,
                        dependsOn = {"second-svc", "first"}),
        }
)
public class MyReconciler implements Reconciler<Project>, ContextInitializer<Project> {
    static final Logger log = LoggerFactory.getLogger(MyReconciler.class);

    @Inject 
    KubernetesClient client;

    @Override
    public void initContext(Project primary, Context<Project> context) {
        context.managedDependentResourceContext().put("client", client);
    }

    @Override
    public UpdateControl<Project> reconcile(Project resource, Context<Project> context) throws Exception {
        var ready = context.managedDependentResourceContext().getWorkflowReconcileResult().orElseThrow().allDependentResourcesReady();

        var status = Objects.requireNonNullElseGet(resource.getStatus(), ProjectStatus::new);
        status.setStatus(ready ? "ready" : "not-ready");
        resource.setStatus(status);

        // manually reschedule to force call `FirstReadyCondition#isMet` 
        // even when new events received from informer
        return UpdateControl.updateStatus(resource)
                .rescheduleAfter(Duration.ofSeconds(10));
    }

    public static class FirstReadyCondition implements Condition<StatefulSet, Project> {
        @Override
        public boolean isMet(Project primary, StatefulSet secondary, Context<Project> context) {
            var client = context.managedDependentResourceContext().getMandatory("client", KubernetesClient.class);

            var options = new ListOptionsBuilder().withLabelSelector("app.kubernetes.io/name=" + secondary.getMetadata().getName()).build();
            var statefulSets = client.resources(StatefulSet.class).list(options);
            if (!statefulSets.getItems().isEmpty()) {
                log.info("secondary status: {}", secondary.getStatus());
                log.info("fetched status: {}", statefulSets.getItems().get(0).getStatus());
            }

            var readyReplicas = secondary.getStatus().getReadyReplicas();
            return readyReplicas != null && readyReplicas > 0;
        }
}

Managed dependent resources are discriminated based on labelSelector:

@KubernetesDependent(labelSelector = FirstStatefulSet.SELECTOR)
public class FirstStatefulSet extends BaseStatefulSet {
    public static final String SELECTOR = "app.kubernetes.io/managed-by=project-operator," +
                                          "app.kubernetes.io/component=first";
    // ...
}

What did you expect to see?

  1. Ready post-condition isMet to eventually return true when StatefulSets readyReplicas becomes 1.
  2. Both StatefulSet reconciled and CR status updated based on WorkflowReconcileResult.

What did you see instead? Under which circumstances?

  1. Ready post-condition isMet based on secondary resource status always returns false since it receives same cached secondary resource from the moment it was reconciled.
  2. Workflow hangs after reconciling the first StatefulSet.

Logs demonstrate that actual StatefulSet status is updated but secondary passed to isMet is still the same:

secondary status: StatefulSetStatus(availableReplicas=0, collisionCount=null, conditions=[], currentReplicas=null, currentRevision=null, observedGeneration=null, readyReplicas=null, replicas=0, updateRevision=null, updatedReplicas=null, additionalProperties={})
fetched status: StatefulSetStatus(availableReplicas=0, collisionCount=0, conditions=[], currentReplicas=1, currentRevision=first-p1-6dc67d5df7, observedGeneration=1, readyReplicas=null, replicas=1, updateRevision=first-p1-6dc67d5df7, updatedReplicas=1, additionalProperties={}

Environment

Additional context

I'm implementing an operator for a legacy system consisting of a bunch of both stateful and stateless microservices which requires strict startup order for some of them, so I tried workflow feature.

Just dependsOn is not enough since reconciler will start second dependent service reconcilation right after first one is reconciled (but not ready yet). Thus readyPostcondition.

For several managed dependent resources of same type I used approach with discriminating them by label selector like app.kubernetes.io/managed-by=...,app.kubernetes.io/component=... where component is unique for the resource type among resources managed by this operator.

See also: https://discord.com/channels/723455000604573736/780769121305493544/1032712459200503829

csviri commented 2 years ago

Thx @grossws we will take a look on this soon.

csviri commented 2 years ago

@grossws since this use case is not supported now in 3.x but will be in 4.1 - that is going to be released hopefully next week - what I did is reproduced this based on the sample in this PR: https://github.com/java-operator-sdk/java-operator-sdk/pull/1581

it works flawlessly. hope Is it ok for you to upgrade to this version (the quarkus extension might come a little later).

csviri commented 1 year ago

(if not objections will close this issue for now)

grossws commented 1 year ago

Thanks for investigating, @csviri.

I hope Quarkus folks will integrate it soon after your release, they already have 5.0.0.Beta1 with Java Operator SDK 4.0.3