Closed tnozicka closed 1 week ago
Ideally add a test with broken NodeConfig a) on RAID b) on mounts so we know the final conditions also work
when we have the aggregated conditions we should update printers as well, see #1730
Ideally add a test with broken NodeConfig a) on RAID b) on mounts so we know the final conditions also work
@tnozicka do you know any simple reproducers of the two cases?
I suppose a) reference unexisting device or a broken one b) mounting over an existing mount
@tnozicka I need some insight into the intended semantics of the current conditions.
NodeConfigReconciledConditionType
now:
https://github.com/scylladb/scylla-operator/blob/c3e79162e1d6f24f2eb392bb1d26ab81c0876426/pkg/api/scylla/v1alpha1/types_nodeconfig.go#L29
What is the exact meaning of it? The comment says Reconciled indicates that the NodeConfig is fully deployed and available
- what does "fully deployed" mean exactly?
https://github.com/scylladb/scylla-operator/blob/c3e79162e1d6f24f2eb392bb1d26ab81c0876426/pkg/controller/nodeconfig/status.go#L19
The status calculation suggests that it refers to the required DaemonSets being deployed - I assume that with DaemonSet controller conditions this will no longer be needed?Available
condition mean in the context of NodeConfig? It's not really intuitive and I don't see any of the controllers setting it in the current implementation.The status calculation suggests that it refers to the required DaemonSets being deployed
update+available
I assume that with DaemonSet controller conditions this will no longer be needed? What does Available condition mean in the context of NodeConfig? It's not really intuitive and I don't see any of the controllers setting it in the current implementation.
The thing with nodeconfig is that it technically may not have a long running workload to call it "available", so I chose "Reconciled" initially, before we had the generic sync functions or the other patterns. I suppose we can use the available as well here in a broader sense. I guess implementation wise we'll always have to have something long running to make it level based, unless kubernetes exposes those through an API.
Should the aggregated node conditions be taken into account when aggregating the general conditions? In other words, should the aggregated NodeConfig conditions also reflect conditions of particular node setups?
Can you elaborate here or give me an example, I am not sure I got the point.
I suppose we can use the available as well here in a broader sense.
Wouldn't progressing make more sense? It seems better aligned with ScyllaCluster ans StatefulSets for example.
Can you elaborate here or give me an example, I am not sure I got the point.
For example, should RaidControllerNodeubuntu-2204Degraded
be aggregated into general Degraded
? Same for available/progressing.
Wouldn't progressing make more sense? It seems better aligned with ScyllaCluster ans StatefulSets for example.
Progressing and Available are given. I though we are talking Reconciled vs. Available.
For example, should RaidControllerNodeubuntu-2204Degraded be aggregated into general Degraded?
RaidControllerNodeubuntu-2204Degraded
=> RaidControllerDegraded
=> Degraded
But there is also the aggregation per node. It's all a bit weird since some conditions come from the workload but fot the aggregation purposes, I'd just do:
*Available
=> Available
*Progressing
=> Progressing
*Degraded
=> Degraded
Thanks for all the responses so far.
RaidControllerNodeubuntu-2204Degraded => RaidControllerDegraded => Degraded
With how this is implemented now, I was thinking this should be more of a RaidControllerNodeubuntu-2204Degraded => Nodeubuntu-2204Degraded => Degraded
flow? Because the node conditions are aggregated per node now (RaidControllerNodeubuntu-2204Degraded => Nodeubuntu-2204Degraded
), so a more natural approach to me would be to aggregate all the NodeConfig controller conditions (like DaemonSetControllerDegraded
, which are not set up now at all) + all the node aggregates (Nodeubuntu-2204Degraded
etc.). So the NodeSetup controller would do the per-node aggregation, and the NodeConfig controller would do the top-level aggregation. Otherwise the NodeConfig controller would have to aggregate partial conditions coming from independent controllers.
I don't find the "mid" aggregations particularly useful (and there is a lot of options), so I'd likely only generically aggregate the *Degraded
=> Degraded
and leave the rest when we need it. But the only aggregation case I see useful at this point is to reasonably wait for it to be successfully done which is covered by the wildcard aggregation without any intermediate steps.
I don't find the "mid" aggregations particularly useful (and there is a lot of options)
I think it's useful because the NodeConfig controller has to read these conditions from the status, since they come from different controllers (NodeSetup), so if they're not there - we have to assume a worst-case default. I guess it's easier to do per-matching node then per matching node AND per each controller of NodeSetup. But ok, I understand the conclusion is the intermediate conditions are an implementation detail.
*Degraded => Degraded
I don't think I follow what this is about.
I was think about just taking the existing conditions on the status and doing
degradedCondition, err := AggregateStatusConditions(
FindStatusConditionsWithSuffix(*allConditions, "Degraded"),
metav1.Condition{
Type: "Degraded",
Status: metav1.ConditionFalse,
Reason: internalapi.AsExpectedReason,
Message: "",
ObservedGeneration: generation,
},
)
probably fixing the logic to check observed generations since the conditions come from different places
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
/remove-lifecycle stale /triage accepted
we also need a way of figuring out if all the asynchronous statuses has been reported already (from node setup pods)
What should the feature do?
The needs to be a way to asses whether NodeConfig has succeded but we only have conditions like
RaidControllerNodeubuntu-2204Degraded
where the node is variable and impractical/impossible to query.What is the use case behind this feature?
Visibility and debugability. When say a RAID setup or mounts fail, it need to fail at the correct place where we wait for the NodeConfig to apply and not 5 command later when manager or scyllacluster breaks.
Anything else we need to know?
No response