Application layering in app-platform

mproffitt commented 2 years ago

Is your feature request related to a problem? Please describe.

When deploying clusters, customers often see failures in services and applications due to reliance on core components which have not completed their deployment.

Describe the solution you'd like

The solution for this would be to implement an ordering of applications inside app-platform which causes a delay to be inserted into application deployments at three levels

core services
default apps
customer deployments

What this solution must not do, is enable application ordering inside any single layer but must introduce the ability to control how/when any given layer is deployed based on the status of the previous layer + any potential required stabailisation (e.g. cache warming) time.

This would enable confidence in cluster deployment as well as ensuring an overall reduction in the failure rate of deployments.

Additional context

Related tickets:

gianfranco-l commented 2 years ago

The effort to design and implement this is quite high, and team doesn't have capacity for this right now. Tentative ETA> 4 months

gianfranco-l commented 2 years ago

@giantswarm/team-honeybadger pls leave your inputs about how do envision this to be tackled, pros and cons of each scenario, in preparation for bringing visibility for SIG product

piontec commented 2 years ago

There are 2 main ways of implementing this: explicit and hidden.

By "hidden" I mean we add some internal logic to app-operator that will make it handle some App CRs differently than the others, choosing which behavior to run based on stuff like names of App CRs. For example, an App CR named nginx-ingress won't be reconciled until the App CR named openstack-default-apps is not ready. I am strongly against this approach. The main problem I see is that it makes behavior completely non-intuitive, based on some hidden naming scheme or other properties. In general, it introduces a new functionality, but makes it hidden and not accessible explicitly in our API. It creates a "surprise effect", where it is suddenly hard for someone not knowledgeable to explain why the controller behaves differently for two similar objects.

The "explicit" solution is expected to introduce API change, so to extend the current definition of App CR with something like:

metadata:
  name: nginx-ingress-controller
...
dependencies:
  requiresDeployed:
    - openstack-default-apps
...

that way we can describe in the object itself that it doesn't make any sense to act on it if other objects are not deployed. This use case is for me a subset of full 'dependency managment' story and makes probably a good first step towards implementing that. Still, I would like to tackle it in this context: as part of a bigger 'app dependencies' story, which - in the big picture schema - should also support things like required versions, negative dependencies (can't be deployed with) or alternatives (requires any implementation of ingress API).

ljakimczuk commented 2 years ago

My 2 cents.

About the "hidden" solution, I think the example given against it is a bit exagerated. It is not about hidding any complicated logic based on naming, it is about acknowledging by App Operator it takes part in reconciling the workload cluster, because it reconciles the default apps without which the cluster is not ready, and to account for this. Default apps are integral part of the cluster, it just so happens App Operator is tasked with reconciling them, in other words, it is tasked with making the cluster ready. So it already has the knowledge needed to postpone reconciliation of user apps until the cluster is ready, it just ignores that knowledge right now. It may be said this logic is hidden, but in reality that's not entirely true I belive, because we all know of this logic existing right now, even the users, we even desire it by telling users to stop certain apps when creating / re-creating the cluster, or to do things in a specific order. But I agree with the part that it would be nicer to have an external way of defining priorities and orders, as a new feature, that would allow to capture this dependency and open the doors for defining others.

About extending App CR, I strongly disagree with this as the solution for this particular problem we had. This is a nice feature for the user to have, so that he can define prorities between some apps, but I find it strongly cumbersome for a user to set a dependency on every single default app, on his app's App CR. I mean, with dependencies defined this way we would need to list all the default apps in there, if we wanted to postpone reconciliation of some App CR until the cluster is created and ready. It can be improved slightly by setting dependencies on default catalog but this is still cumbersome to set it on every non-default app. Especially it feels like pushing responsibility for implementing the logic we all know exists to the user. I mean, we know it is better to reconcile user apps when the cluster is ready, but in this scenario we force user to configure it by listing, on his apps, all or some of the default apps contributing to the cluster being ready. User obviously does not need to do it right away for all his apps, it may be done ad hoc whenever the problem is spotted, but still it is a cumbersome way of dealing with issue we know exists, because hitting it from time to time is frustrating, each time somebody hits it he, or somebody else, must spend some time troubleshooting it a bit before confirming order is the problem, so that is why it would be better IMO to have something pre-defined.

From my point of view, the solution that should satisfy both use cases, would be similar to Kubernetes Priority and Fairness. I imagine having an AppPriority that would assing apps to some priority levels the App Operator will respect. It could assing apps coming from default catalog (default apps) to priority 100, the giantswarm non-default apps to priority 200, and all the other apps to 1000, etc.:

apiVersion: apppriority.application.giantswarm.io/v1alpha1
kind: AppPriority
metadata:
  name: default-apps
...
spec:
  match:
  - catalog: default
  priority: 100
---
apiVersion: apppriority.application.giantswarm.io/v1alpha1
kind: AppPriority
metadata:
  name: giantswarm-apps
...
spec:
  notMatch:
  - catalog: default
  priority: 200
---
apiVersion: apppriority.application.giantswarm.io/v1alpha1
kind: AppPriority
metadata:
  name: catch-all
...
spec:
  priority: 1000

This way we can capture the logic we all now exists, but also leave a way for users to define priorities or dependencies between apps, without forcing them to update N App CRs they may have, just to wait till cluster is ready.

uvegla commented 2 years ago

I like the AppPriority approach the best. It is kubernetes-like and decouples the declaration from the App CRs.

piontec commented 2 years ago

Nice comments :) I agree, that explicit dependencies between App CRs will be cumbersome in this case, like you said, because it means every App needs to be altered and configured (list the dependencies) to make it work.

Right now, I see the following options:

Do nothing in app-operator. This is still (IMHO) a valid solution. The Kubernetes way is that stuff is expected to fail, it is only expected to work eventually. So, if a specific app can't be installed because it's crapping out before the cluster is usable, it's really a problem with that app, not with how Kubernetes works. This also leaves the app-operator contract very clean: by creating an App CR, I express my will to try and install this app now.
Extend app-operator
1. The proper way, by introducing something like AppPriority like proposed above. Still, what is the semantic of the AppPriority? Does it only tell the order in which apps are reconciled? That doesn't solve anything. So it probably should block reconciliation of apps of lower priority before the higher priority apps are Deployed. But there are a ton of corner cases. What if we reconciled and deployed prio 100 and prio 200 apps, then 1 of the prio 100 fails an upgrade? Do we uninstall all prio 200? What if a customer decides to deploy new prio 200 when one of the prio 100 is down? And other prio 200 are already installed?
2. More hacky way. We introduce a config option to app-operator that configures an annotation on App CRs, that says "This is cluster bootstrap app (default app). If you see this, reconcile it until it's Deployed, then screw it, reconcile everything as it is".

WDYT? Also, please move this discussion to https://github.com/giantswarm/giantswarm/issues/24081, as we're getting very much into implementation ideas here (so potentially long thread), while missing input from other teams about the whole "cluster bootstrap" problem.

ljakimczuk commented 2 years ago

Regarding 4. This is basically what the @mproffitt proposition was about, but instead of adding another annotation, this decision can be made already on the data we currently have, like default catalog because apps contributing to the cluster being ready come from there. This is in turn what I meant by saying we have the knowledge currently in our hands, but not using it, and that we do not need any complicated logic.

Regarding 3, well...

Does it only tell the order in which apps are reconciled?

One of the things it should do I believe, is to define order between group of apps. But this should also block reconciliation of the next group, if the first one is not properly reconciled, because as noted this otherwise won't be of much help. In my mind, this should behave like ordering and dependencies as the same time. But this is an implementation detail we must come up with if we decide to take this path. I wasn't trying to come up with a complete solution for now, but merely trying to decide the right place where it should be implemented in.

What if we reconciled and deployed prio 100 and prio 200 apps, then 1 of the prio 100 fails an upgrade? Do we uninstall all prio 200?

The App Operator should never uninstall anything if not requested by the customer.

What if a customer decides to deploy new prio 200 when one of the prio 100 is down? And other prio 200 are already installed?

Again, this is the implementation detail we should come up with if we decide to go this path. I can imagine blocking prio 200 as too much on one hand, but on the other hand I can also imagine it may be needed, if for example upgraded version of this prio 200 app introduces something that may block prio 100 indefinitely.

gianfranco-l commented 1 year ago

closing in favor of https://github.com/giantswarm/roadmap/issues/1965

giantswarm / roadmap

Application layering in app-platform #1489