giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Let customers install apps on the management cluster #1073

Open teemow opened 2 years ago

teemow commented 2 years ago

User Story

Examples:

Details, Background

Some automation/management functionality might be optional from our perspective. It would be good to let customers install (managed) apps to avoid that customers have to create second class management clusters (eg use a workload cluster as an "automation" cluster) that adds a lot of complexity in terms of security and networking (needs peering to all other workload clusters).

This would also allow us to let customers play with tools already that are on our roadmap but not yet ready to become fully managed by us on the management cluster.

We need to ensure that there is isolation from sensitive functionality and secrets. But enabling the platform teams at the customer is more important than preventing any problem in the first place. Customers can also destroy workload clusters but we've decided to value freedom and flexibility more.

Related: https://github.com/giantswarm/giantswarm/issues/23419

### Tasks
- [x] Let customers install apps on the MC
- [ ] https://github.com/giantswarm/roadmap/issues/2456
- [ ] https://github.com/giantswarm/roadmap/issues/2457
gianfranco-l commented 2 years ago

@giantswarm/team-honeybadger We are doing careful (baby) steps towards this big scope with discovery work for a specific customer case. cc @MarcelMue can add more details if needed. We will keep an eye on this as a larger story while we continue doing discovery work for this specific customer and future customers that might be interested in this as well

Update June 9th: team can not progress on this because of other higher priorities (and no bandwidth) so this has been put on hold (roadmap status --> future > 6 months)

JosephSalisbury commented 2 years ago

We chatted in SIG Product WH, @cornelius-keller brought up that this is getting important for THG, they want to run their own operators to create OpenStack infrastructure.

There's a discussion about using Crossplane, and whether it's okay if they're running their own operators.

Let's chat about this one further next week and get to a decision on this one.

piontec commented 2 years ago

From my point of view, allowing customers to run any app they want on the MC will be really risky. We just spent 1 month trying to patch security holes in app-op and chart-op. And these are just 2 operators, and one we know and have created ourselves. Preparing MCs so that customers can safely run anything there is going to be much harder, I think. This might also be a serious problem or even a blocker for multi-tenant clusters. And even when we solve the security of it perfectly, we're risking noisy neighbor problems that can DoS our basic services.

On the other hand, right now, nothing really blocks customers from deploying any app (with App CR) they want to a MC. That's what I was discussing today: should we actively block (for now) such attempts?

My idea so far was that we offer some commonly requested apps on MC, but deployed and managed entirely by us, like we do now with Flux. I was expecting we will do the same for Crossplane or Harbor.

What is blocking customer from running Crossplane on a standard WC? The only problem I know so far is when customer's app needs to know kubeconfigs for all the WCs - that's something they can get from MC only.

cornelius-keller commented 2 years ago

Customer wants to provision infrastructure together with a cluster that is being created. They want to have have the CRs in the same commit and repo as the cluster definition. They are currently developing operators to do this, currently without crossplane.

teemow commented 2 years ago

I am sure we will see more use cases in the future where it makes sense to extend the management api. And I think in the first place we should not stand in the way of our customers and allow them to install apps. Otherwise we'll always be the bottleneck.

There are many examples. Customers want to install Harbor, Crossplane, ArgoCD or Backstage on the management cluster. Let's show them how to do this and enable them. Once we have capacity and time we can take over helm chart management and operations. But let's not block the customers in the first place.

And yes, we need to make sure this is secure and robust.

MarcelMue commented 2 years ago

I think by far the biggest concern we currently have is how to do it securely and robust. It feels to me like RBAC will not be enough to make it work because most of the apps which were mentioned by Timo expect some kind of elevated privilege to run. For example ArgoCD will not work at all without giving it very high privileges. Crossplane needs cluster wide access to all the CRs it manages.

So what could even be a realistic approach to make it possible for customers to install these apps? I feel currently that this would not work in a multi tenant MC.

teemow commented 2 years ago

I think we are fine disabling this for multi-tenant MCs.

What we need to find out then is:

  1. How to create a user role that is allowed to install apps on MCs and how to create roles that are not allowed to do this and don't need such high privileges?
  2. In case of high privileges which (giant swarm) secrets can be read and how can we make sure that the scope of these secrets is exactly one customer? This way we don't need to worry about higher privileges on MCs imo.

Does that make sense?

MarcelMue commented 2 years ago

I think 1. is the real issue though: No matter how we create the roles, some of these apps simply do not work with lowered permissions. We could try to bring changes to the respective upstream projects but the effort would likely be huge.

To me 2. seems to be the more sensible approach then, it's not as nice but it seems more realistic to me in a reasonable amount of effort / timeframe. Would love to hear other @giantswarm/team-honeybadger opinions though.

teemow commented 2 years ago

@MarcelMue ok let's make sure to not mix up gitops and MAPI.

With 1. I was talking about MAPI. We just need to make sure that people can access the management cluster to install an app on a WC without being able to install something on the MC and without risk that secrets leak.

The gitops pipeline account/role is a completly different beast. This is where 2. comes in.

MarcelMue commented 2 years ago

You have lost me:

1. How to create a user role that is allowed to install apps on MCs and how to create roles that are not allowed to do this and don't need such high privileges?

This to me means installing an app on the MC where the pods of the app are running on the MC.

With 1. I was talking about MAPI. We just need to make sure that people can access the management cluster to install an app on a WC without being able to install something on the MC and without risk that secrets leak.

This is something we already have IMO. People can install their own apps on their WC using MAPI.

I didn't relate anything on GitOps: just simply if we have an app created by a customer on the MC where the actual application is supposed to be running on the MC then we have large permission issues (because most upstream apps/controllers in general ever assume they could be running in a restricted permission environment).

teemow commented 2 years ago

I was not sure if we can distinguish between apps on MC and apps on WC already with RBAC.

puja108 commented 2 years ago

We talked about this today with Honeybadger, and decided we need to at least discuss this again in SIG Product as we are seeing more and more customer requests for running central management tools that we currently don't offer (yet). It is already scheduled for a future SWH, but as those recently got skipped a lot and vacation time is upcoming, I'd like to at least quickly discuss in the next sync and also kick-off async thoughts here.

We know the impact on not only Honeybadger, but also on KaaS teams that are responsible for the MCs.

We also know that there's some blockers in terms of security and MC setup as well as maybe thoughts towards better isolation.

Discussing and thinking about this feature soon is critical so we keep its implications in mind while working on the new MC setup for CAPI (cc @cornelius-keller @alex-dabija), e.g. in the way we create and manage our secrets.

puja108 commented 2 years ago

Thinking about this a bit more we might need to also scope the different parts of this epic. Something like:

  1. MCs that enable customer deployments through a secure and scalable setup (KaaS)
  2. A safe and reliable way for customers to make such deployments (Honeybadger DX)

Having the parts clear makes it easier to think about the respective work and effort for each team.

piontec commented 2 years ago

THG has deployed their first operator. We also have a PDR that lists assumptions for customers running stuff on MCs using the automation Service Account.

Should we now close this ticket and replace it with something more specific about how to continue?

puja108 commented 2 years ago

Fine to close in favor of something specific which would represent the future coverage of the feature.

teemow commented 2 years ago

Do we have docs how to do this? We have other customers that are interested in the feature too.

kubasobon commented 2 years ago

Please note some manual actions were needed on the GS side: we needed to add RBAC that allows customers to install needed CRDs. The RBAC has been limited to CRD names needed by their operators.

QuentinBisson commented 2 years ago

Monitoring of those apps is also to be taken into account here and so far has not been discussed

kubasobon commented 2 years ago

Yes, of course. The existing rules will also have to be updated (DeploymentNotSatisfied for example).

QuentinBisson commented 2 years ago

I am mentioning that because one customer wants to be able to create their own service monitor in the MC and maybe deploy a prometheus and alertmanager there and this will collide with our own setup as we currently do not restrict where to look for rules and service monitors as it would be counter productive

teemow commented 2 years ago

Let's also discuss having those metrics in our MC prom and creating a section in Grafana where customers can create their own dashboards. Imo this would be much more convenient. Bonus points if the customer can also add alertmanager routing and prometheus rules for their own metrics :smile: .

teemow commented 2 years ago

@QuentinBisson can you discuss this in Atlas? Maybe it makes sense to split off an issue. Happy to help with the specs or to setup a meeting with the customer.

QuentinBisson commented 2 years ago

Definitely, I created this issue that we need to refine and prioritize. Did we announce an ETA on this feature to customers? Once we refine it, let's setup a meeting with the customer :)

teemow commented 2 years ago

No eta yet. They might be fine for now with the workaround you proposed using remote write. We haven't done any research how many customers would want this too.

Letting customers add their own dashboards apart from apps on the MC is definitely something to look at separately too. This might be interesting for special cases/clusters where our dashboards aren't enough.

QuentinBisson commented 2 years ago

You are right and it will definitely be if we monitor their apps. It is actually possible today but we probably forgot to communicate that at least internally as teams could provide the dashboards with their apps instead of using the dashboards repository. Documenting it in the intranet first would be a good thing

teemow commented 11 months ago

@weatherhog @piontec what is left todo here?

piontec commented 11 months ago

@weatherhog @piontec what is left todo here?

It depends.

If you mean "to install some apps, that don't need clsuter-admin permissions and that we probably have to help with", then nothing - we've done that already.

If you mean general functionality, where customers are finally cluster-admins (without major security problems), it's tracked by this ticket, which in turn is blocked by this one. As this solves a lot of headaches for us as well, we've already started working on it.

teemow commented 11 months ago

Alright. Thanks. I've added the issues to a tasklist in the description.