razvan commented 1 day ago

Problem

Some data products are very fragile when some of their services are restarted. These restarts may induce long response times or even data loss causing major disruptions in production environments.

When the secret operator generates self signed certificates, consumers cannot easily request a validity lifetime. Pod overrides can be used as a workaround but as usual with this mechanism, the user experience is bad and the platform exposes much more internals than necessary.

If not specifically requested, the secret operator uses a default value of 24 hours for the lifetime of the mounted TLS certificates.

Before the certificate lifetime expires, the requesting Pods are restarted to receive fresh certificates. In production environments, this value has been proven to be very disruptive and customers have requested to increase it.

Proposal

We propose to make the certificate lifetime configurable via a new field called requestedSecretLifetime at group level.

An example snippet from the HDFS operator:

---
apiVersion: hdfs.stackable.tech/v1alpha1
kind: HdfsCluster
metadata:
  name: hdfs
spec:
 nameNodes:
   config:
     requestedSecretLifetime: "14d"
 journalNodes:
   config:
     requestedSecretLifetime: "7d"
 dataNodes:
   ssdMachines:
     config:
       requestedSecretLifetime: "1d"
   hddMachines:
     config:
       requestedSecretLifetime: "3d"

Each operator will define a sensible set of default values at role level as appropriate.

This would support longer restart intervals for critical services that may experience data loss and shorter restart intervals for other high-available services like HDFS data nodes.

Alternatives

The following alternatives have been considered and dismissed.

Cluster wide

Alternatively, a configuration property could be added to the individual cluster configuration types for each operator like this:

---
apiVersion: hdfs.stackable.tech/v1alpha1
kind: HdfsCluster
metadata:
  name: hdfs
spec:
 clusterConfig:
    requestedSecretLifetime: "7d"

This is considered a bit too restrictive.

Role config

TODO: is this rejected only on syntactic grounds or is there a semantic problem with it? Is this too restrictive ? Are there valid cases for different groups of the same role having different values ?

---
apiVersion: hdfs.stackable.tech/v1alpha1
kind: HdfsCluster
metadata:
  name: hdfs
spec:
 nameNodes:
    requestedSecretLifetime: "7d"
 journalNodes:
    requestedSecretLifetime: "7d"
 dataNodes:
    requestedSecretLifetime: "7d"

Further consideration

We can see a future where not just the lifetime of self signed certificates can be managed by this property but also other types of secrets like keytabs.

Therefore the naming of the property was chosen to be general enough but also a counterpart to the maxCertificateLifetime property of the SecretClass.

POC (outdated)

This POC was made for the role level configuration.

A POC implementation for the HDFS operator is linked below.

Reference

https://github.com/stackabletech/issues/issues/586

Techassi commented 19 hours ago

Posting this discussion over here to increase visibility.

Outcome: TBD.

sbernauer commented 13 hours ago

Thanks for the very well-written proposal! With motivation and everything 👍 I have some additional remarks:

Regarding the naming: minSecretLifetime implies this is the minimum value. However, it can be less when the SecretClasses max lifetime is lower (see https://docs.stackable.tech/home/nightly/secret-operator/secretclass/#_certificate_lifetime). So I don't think we can make this assumption here, and should probably call it e.g. requestedSecretLifetime to express cluster-admins (controling the SecretClass) can still lower it.

Regarding the default: I think we should have different defaults for different roles (so no single catch-all default in op-rs). Reason being it's totally fine to frequently restart HA-able tools (e.g. hdfs datanodes, superset etc.), so let's default them to 24h, but fatal e.g. for Trino coordinators, so let's default to e.g. 15d there.

Regarding the place: If we put it on the role we IMHO should put it below .spec.myrole.roleConfig.requestedSecretLifetime. However, I would prefer to go the normal config way (basically put it alongside graceful_shutdown_timeout, as this feels more natural and aligns (daily use--) better with the other configs. I see the argument that might take longer to implement, so I quickly spiked it in Trino (see below - that's it!). So IMHO we should do it the "proper" way and not safe development effort here.

diff --git a/rust/crd/src/lib.rs b/rust/crd/src/lib.rs
index 7e6859f..667ff6d 100644
--- a/rust/crd/src/lib.rs
+++ b/rust/crd/src/lib.rs
@@ -432,6 +432,10 @@ pub struct TrinoConfig {
     /// Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
     #[fragment_attrs(serde(default))]
     pub graceful_shutdown_timeout: Option<Duration>,
+
+    /// Bla blub
+    #[fragment_attrs(serde(default))]
+    pub requested_secret_lifetime: Option<Duration>,
 }

 impl TrinoConfig {
@@ -448,6 +452,10 @@ impl TrinoConfig {
             TrinoRole::Coordinator => DEFAULT_COORDINATOR_GRACEFUL_SHUTDOWN_TIMEOUT,
             TrinoRole::Worker => DEFAULT_WORKER_GRACEFUL_SHUTDOWN_TIMEOUT,
         };
+        let requested_secret_lifetime = match role {
+            TrinoRole::Coordinator => Duration::from_days_unchecked(15),
+            TrinoRole::Worker => Duration::from_hours_unchecked(24),
+        };

         TrinoConfigFragment {
             logging: product_logging::spec::default_logging(),
@@ -472,6 +480,7 @@ impl TrinoConfig {
             query_max_memory: None,
             query_max_memory_per_node: None,
             graceful_shutdown_timeout: Some(graceful_shutdown_timeout),
+            requested_secret_lifetime: Some(requested_secret_lifetime),
         }
     }
 }
diff --git a/rust/operator-binary/src/controller.rs b/rust/operator-binary/src/controller.rs
index 6ca1643..f2830eb 100644
--- a/rust/operator-binary/src/controller.rs
+++ b/rust/operator-binary/src/controller.rs
@@ -877,6 +877,10 @@ fn build_rolegroup_statefulset(
     catalogs: &[CatalogConfig],
     sa_name: &str,
 ) -> Result<StatefulSet> {
+    if let Some(requested_secret_lifetime) = merged_config.requested_secret_lifetime {
+        tracing::debug!(?requested_secret_lifetime, "The requested cert lifetime is");
+    }
+
     let role = trino
         .role(trino_role)
         .context(InternalOperatorFailureSnafu)?;

nightkr commented 11 hours ago

The framing in the OP makes me worried that this RFC is getting the issue backwards. The end goal shouldn't be making this easy to configure, it should be making it unnecessary to do so.

Naming

Agreed with @sbernauer that requested makes more sense than minimum. Even maximum would be preferable.

We can see a future where not just the lifetime of self signed certificates can be managed by this property but also other types of secrets like keytabs.

This is also a question of scope; should it be implied that these are always configured together? Would it ever make sense to request a different keytab lifetime than the certificate lifetime? (On second thought: probably not, where relevant that should probably be the domain of a separate policy concept of some kind instead.)

CRD Organization

I agree with @sbernauer that keeping it at the rolegroup makes more sense, role-only fields should be for things that absolutely cannot be configured at the rolegroup level. You can still configure your default at the role level and let all rolegroups bubble up to that.

Rationale/Defaults

Ultimately, this is a workaround for buggy products, not an actual solution for anything. I'm okay with adding targeted exceptions that get higher defaults, but:

There should be a clear and narrow rationale for every exception (good: "Druid experiences data loss when restarting due to #bug-1234", bad: "namenodes are important so they should be long-lived")
There should be a path for removing every exception (do we know the root cause? have we filed a bug for it? are we working on fixing that bug?)
There should be scrutiny, tracking, and followup for every exception, in order to enforce 1 and 2

razvan commented 10 hours ago

I updated the proposal with:

the problem statement expanding on why this is currently needed (may need better wording)
the new field name
that the field can be defined down to group level

I left a TODO for the rejected role level definition. Maybe you can expand more on that.

stackabletech / issues

Make autoTls certificate lifetime configurable #673