karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.51k stars 890 forks source link

[Feature] Stateful Application Failover Support #5788

Open RainbowMango opened 2 weeks ago

RainbowMango commented 2 weeks ago

Summary Karmada’s scheduling logic runs on the assumption that resources that are scheduled and rescheduled are stateless. In some cases, users may desire to conserve a certain state so that applications can resume from where they left off in the previous cluster.

For CRDs dealing with data-processing (such as Flink or Spark), it can be particularly useful to restart applications from a previous checkpoint. That way applications can seamlessly resume processing data while avoiding double processing.

This feature aims to introduce a generalized way for users to define application state preservation in the context of cluster-to-cluster failovers.

Proposal

Iteration Tasks -- Part-1: Ensure scheduler skips clusters where triggers the failover

Iteration Tasks -- Part-2: state preservation and feed

Iteration Tasks -- Part-3: failover history The failover history might be optional as we don't rely on it. TBD: based on #5251

mszacillo commented 2 weeks ago

Looks great, thank you!

Could we add a checklist item to include a default failoverType label onto the resource that has been failed over?

RainbowMango commented 2 weeks ago

Could we add a checklist item to include a default failoverType label onto the resource that has been failed over?

I don't have a strong feeling that we do need it, because according to the draft design, you can declare the label name to whatever you expects. For instance, you can declare the label name with karmada.io/failover-flink-checkpoint. Then, you can configure the Kyverno with that label. Am I right?

RainbowMango commented 2 weeks ago

@mszacillo I'm trying to split the whole feature into small pieces, hoping more people could get involved and accelerate development.

For now, it's working in progress, but glad you noticed it, let me know if you have any comments or questions.

mszacillo commented 2 weeks ago

@RainbowMango I think that's a good idea, and having this feature available faster would be great. :)

Do you have a preference on who will be working on which task? If not I can pick up the introduction of PurgeMode to the GracefulEvictionTask today.

In addition, could we start a slack working group channel? Given the time differences, I think being able to have more rapid conversations on slack would improve the implementation pace.

mszacillo commented 2 weeks ago

I don't have a strong feeling that we do need it, because according to the draft design, you can declare the label name to whatever you expects.

That's true, we can simply declare our own label name for the use-case. In the case of a failover, it might be helpful to distinguish between cluster + application failovers, and only Karmada has the context. But perhaps I'm creating a use-case before it's even appeared.

RainbowMango commented 2 weeks ago

Do you have a preference on who will be working on which task? If not I can pick up the introduction of PurgeMode to the GracefulEvictionTask today.

Sure go for it! Assigned this task to you. I think you are the feature owner, it would be great if you could work on it :) Generally speaking, anyone can take the task without an assignment by leaving a comment here. The issue owner(it's me in this case) will assign it by adding the name to the end of the task.

RainbowMango commented 2 weeks ago

In the case of a failover, it might be helpful to distinguish between cluster + application failovers, and only Karmada has the context. But perhaps I'm creating a use-case before it's even appeared.

Yeah, the only benefit I can see is that it might help to distinguish failover types, but I think there is no rush to do it until there is a solid use case. I added a checklist item for this; we can revisit it later.

Double confirm if we need to introduce a default label to distinguish the failover type.(Waiting for real-world use case).

RainbowMango commented 1 week ago

Make changes to the RB application failover controller and CRB application failover controller to build eviction task for PurgeMode Immediately. (@mszacillo)

@mszacillo assigned this task to you according to the discussion on https://github.com/karmada-io/karmada/pull/5821#pullrequestreview-2438835388.