Flux (and apps) in MC logs visibility to customers

gianfranco-l commented 2 years ago

From a customer request:

In the context of Management cluster auth via Dex against Okta , the customer requested:

Need logs visibility for Flux (as an admin)

kubectl logs -n flux-system kustomize-controller-5bd7f98f56-frkd
TODO: check whether this is possible

gianfranco-l commented 2 years ago

@marians @pipo02mix if some context is missing, please addi it :) cc @giantswarm/team-honeybadger @MarcelMue

kubasobon commented 2 years ago

I'm quite strongly against allowing the (unfiltered) log access. It's a managed solution - customers should not have access to anything in flux-system by principle. I'd rather us expose the Flux-related Prometheus metrics, preferably through an exporter which grants us ability to filter them, than give direct access to logs.

marians commented 2 years ago

(1) and (2) are very much separate requests. I suggest to split them. (2) is quite customer specific, so potentially worth discussing in a non-public repo.

pipo02mix commented 2 years ago

IMO makes sense to give them view access for the flux that managed the apps/cluster they install (not the one manage our components). If we put that in a dashboard is even better

piontec commented 2 years ago

The problem is that even if we figure out the permissions problem reasonably, there's no way we can do org separation in logs. I really don't see much value in here, as really flux's CR statuses are pretty verbose. Do you have as specific use case, where logs were really needed to debug a problem?

marians commented 2 years ago

To my understanding, org separation is not an issue here. This is about access for customer admins, who are bound to the cluster-admin clusterrole currently via the customer's default admin group (in this case customer:rg_okta_giantswarm-admin).

gianfranco-l commented 2 years ago

we will bring this to SIG product to discuss if we shall provide access or not to logs for apps on the MC to the customers

gianfranco-l commented 2 years ago

What do you think of this @JosephSalisbury @weatherhog @puja108 ?

JosephSalisbury commented 2 years ago

I reckon this can be done with https://github.com/giantswarm/roadmap/issues/311, sure

up to @weatherhog for priority, but iirc it's soon

puja108 commented 2 years ago

From reading the above it is still unclear to me what the customer requested (what logs do they want to see exactly? which Flux? what resources are they interested in?) and more importantly what their use case for seeing those logs is, i.e. what are they trying to achieve, how is it failing, why do they need log access to debug that,.... Only when the use case is clear can we realistically evaluate what makes sense going forward and if log access in any form would really solve it and/or what alternative we'd rather offer.

pipo02mix commented 2 years ago

the use case would be

As a cluster admin, I would like to debug why some (managed) apps are not installed in my workload cluster

I am not sure if they need logs or status show enough. But IMO getting logs from flux is not something will hurt us

marians commented 2 years ago

Regarding which logs the customer wanted to see, that info is very detailed: those of the kustomize-controller-* pod in the flux-system namespace.

JosephSalisbury commented 2 years ago

Discussion in SIG Product Sync:

Customers should be able to access logs for components in MC
Pinging @weatherhog & @JosephSalisbury, this is dependent on Atlas' larger logging story
We want to be aware that access to logging (as with metrics) is only available to an admin group (not all users)
@gianfranco-l / @PrashantR31 it might be most straightforwards to work out who should give access to the customer, and then create an appropriate Role / RoleBinding, to get the customer issue sorted - it's not a product solution, but solves for the customer until the product solution is sorted
currently, we don't store these logs

teemow commented 2 years ago

From a product point of view:

Admins on the customer side should have access to all logs on the management cluster. We want to be transparent. Admins should be able to understand what is going on. They should also be able to debug their own actions or the managment cluster misbehaving.

So an admin role should have the permission to access logs of all pods.

ghost commented 2 years ago

@giantswarm/team-atlas , tangent question: does our Loki support SSO ?

weatherhog commented 2 years ago

@PrashantR31 Loki itself does not provide any authentication mechanisms. So in the End this a no to SSO support. If authorisation is needed for loki this is mainly dependent on the customers preferred way. It could be a combination out of loki and oauth2-proxy or it also could be loki saas.

ghost commented 2 years ago

Thanks @weatherhog , just too bad Loki does not have SSO support

But what I was thinking that:

If Loki can be installed on the MC's where Flux is installed and all the logs will go there for admin access only.
And if Grafana is on the MC we could provide just a few basic dashboards views for the customer, or even use the Grafana cloud (if this is possible)
Extending that, if we add alerting on logs that could give a complete observability solution for the customer Admin users.
Grafana has SSO capabilities, so we could allow only allow the admin group to certain dashboards. These are some thoughts and do not know how much in that is feasible or is in line of what we are providing.

teemow commented 2 years ago

Fyi @PrashantR31 Atlas already has a story about what you described #311

ghost commented 2 years ago

Action from refinement in Rainbow :

[x] Rainbow create a detailed specs so as to help HoneyBadger to create the Role and Role-binding to the default Admin group in flux app via config.
[ ] Handover to @giantswarm/team-honeybadger for implementation

marians commented 2 years ago

As a spec, we would need the following Role in the flux-system namespace:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: read-pods-and-logs
  namespace: flux-system
rules:
- apiGroups: [""]
  resources:
  - pods
  - pods/logs
  verbs:
  - get
  - list

To assign this role to the customer's cluster admin group, a RoleBinding like the following would be required:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-flux-logs
  namespace: flux-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: read-pods-and-logs
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: "WRITE_ALL_GROUP_NAME"

The value for WRITE_ALL_GROUP_NAME can be found in https://github.com/giantswarm/config/installations/INSTALLATION/apps/rbac-operator/configmap-values.yaml.patch Example

marians commented 2 years ago

Next step: manually deploy, get customer confirmation.

ghost commented 2 years ago

During planning, Gianfranco will check if HoneyBadger can now takeover as specs was provided by Marian already.

gianfranco-l commented 2 years ago

added to current sprint, someone from the team @giantswarm/team-honeybadger will pick this up and implement it :)

puja108 commented 2 years ago

This also came up with another customer, and they'd like to get log access to more than just Flux. As Timo mentioned above we should see if we could make access to logs a default for all pods/ns (or just have few exceptions where we see privacy concerns).

As this becomes a general request around access on the API level, would Rainbow feel responsible for it or at least driving it with team's help?

MarcelMue commented 2 years ago

I tried to look at implementation in a "quick and dirty" way but we have to do a little more thinking:

The WRITE_ALL_GROUP_NAME is only available in the config repo it seems
The management-clusters-fleet repo which could hold these resources, does not have access to variables from the config repo
rbac-operator has access to the variables from the config repo BUT does not have an operator resource which would currently fit to create resources in the flux system namespace

To me the best option seems to be to add a new operator resource to rbac-operator which handles these variable bindings. Does anyone have a better idea @giantswarm/team-honeybadger ?

puja108 commented 2 years ago

Alternatively, we'd make it a default and do not need to react on variables, or not?

MarcelMue commented 2 years ago

Alternatively, we'd make it a default and do not need to react on variables, or not?

I think the issue is that the WRITE_ALL_GROUP_NAME is somehow linked to the OIDC provider used and therefor variable :/

marians commented 2 years ago

Yes, every customer has their own idea of group naming in their identity provider. So the value for WRITE_ALL_GROUP_NAME is individual.

MarcelMue commented 2 years ago

Honeybadger discussed 2 options:

We create a dedicated app which only installs these RBAC Roles & Rolebinding.
We add the permissions of this role to the automation role which can then be impersonated by the customer.

We will check which option is less effort to implement and then check. If all options are high effort, we re-evaluate.

pipo02mix commented 2 years ago

Any update on which option is the chosen one?

gianfranco-l commented 2 years ago

both options require quite some effort: so what started as a "quick workaround" is actually quite some effort for being just a workaround. given that the product solution is in the works by the relevant team and honeybadger team has other higher priorities at the moment (and no bandwidth), this is currently being postponed at later stage (it has been moved to the board "future > 6 months" of the roadmap)

paurosello commented 1 year ago

@gianfranco-l what is the main blocker for adding the relevant permissions to the current automation SA? This is a relevant request for users of Flux and would remove many support requests.

Would be happy to work with customer to make this work.

gianfranco-l commented 1 year ago

ping @piontec

piontec commented 1 year ago

The main issues that we see with that are:

We still don't get "why": flux presents reasonable messages in its CRs, we use them all the time and very rarely check the logs. What does the customer need that is in the logs only? We also never evaluated logs in context of how much is available there. For sure, we will expose some information about all the Organizations (but I guess it's OK for the automation SA anyway).
This solution is just ugly. We need a highly targeted, custom access rule for a single deployment - sure, we can add it, but making the automation SA a bag of trash where we dump everything that is no match anywhere else seems like an ugly idea to me.

To wrap it up: I think it really makes sense to wait for the MC wide central logging solution (#311) that will solve this on a general level. It would also help to understand which log entries are that important.

paurosello commented 1 year ago

Yeah, we want to go deep on understanding why but there are some cases where its useful:

Something not reconciling at all (could be due to incorrect labels..)
Flux crash looping

Also I would say that this is not only for Flux, as other components might be coming like Crossplane, Secret Operator...

I agree that automation SA might not be the right place we could make another one to list pods and read logs from giantswarm namespace.

We are hindering advanced customers from debugging and forcing ourselves to do it which I think is counterproductive. If in the future we can provide something nice for them to check the logs then it would be amazing and we may introduce some learnings in the way they use the logs.

Would you agree on having a new SA to read all logs from giantswarm namespace a solution until we have the full logging platform? I would organize among SA to make this work

piontec commented 1 year ago

I think you mean a new (Cluster)Role that will be assigned to customers' and automation accounts, right? I like it as an intermediary solution, until we get a real logging solution - it will be easy to revert and remove when the time comes. But I think we should make it "a curated list of pods in giantswarm", not "everything" - we need to think what we can leak in the logs, esp. that is shared across MCs. So, I would make it an opt-in list, where for each pod to be included in the list we do at least a short evaluation of 'what you can find in this log'.

paurosello commented 1 year ago

Yeah, that is it, a new ClusterRole. Not sure we can limit by pods inside a namespace though.

Not sure how we can evaluate what logs come of every operator but I would assume no credentials are leaked at all, maybe account IDs and other infra resources names but I think those are also available in the CRs. In any case will try to find out if we can limit to specific pods and then we add as needed.

piontec commented 1 year ago

If we leak no credentials, I think we're good - that was my only concern :+1:

teemow commented 1 year ago

Imo your last comment @piontec should be our only concern. The "why" isn't the problem. I think @paurosello described it well and imo customers should have access to all logs on the management cluster. There will also be different roles in the future. But next to central logging we need to have native logging (kubectl logs) as well.

At the moment I would see four use cases and we might already separate them

Flux logs
CAPI controllers logs
Organization namespace (eg customer controllers)
All MC logs

Going one by one component is a first option. And then we need a way to become confident to create a role that opens up all logs.

@piontec @gianfranco-l depending on the amount of work that is necessary, can you check if you can enable flux logs so this story becomes unblocked?

@TheoBrigitte @weatherhog can you take the general story around access to MC logs into Atlas and think about it more holistically? Afaik you already have a story about letting customers monitor their own controllers on the management clusters too.

QuentinBisson commented 1 year ago

@teemow We will provide access to the logs to customers once we have a logging solution up an running (one central Loki).

The main issues will arise from how we want to do multi-teancy but this is in refinement.

teemow commented 1 year ago

@QuentinBisson what is the reason to wait for central logging? As I mentioned I think customers should be able to use a simple kubectl logs for all components in the MC too.

QuentinBisson commented 1 year ago

I was answering about the log story regarding atlas and after rereading, I understand how it was misinterpreted 😅 . I agree customers should be able to access the logs using kubectl logs on the MC anyway.

One interesting thing for us would be to ask customers how they are using MC logs as it could be useful for use once we have loki up and running

teemow commented 1 year ago

I agree customers should be able to access the logs using kubectl logs on the MC anyway.

My question was if Atlas can take this too. Somebody needs to think about the general concept of how we structure access to logs on the management cluster.

piontec commented 1 year ago

Flux has a strict set of pods, so targeting them with RBAC shouldn't be that hard. @paurosello , have you already started, do you have something you can share?

@gianfranco-l , as we need to prioritize this, I'm putting it back into our Inbox.

@teemow Generally speaking, we can "just" grant the (Cluster)Role access to pods/log everywhere, but I really don't feel comfortable doing that - personally, I have no idea what we might leak in the logs. Is there a reasonable (time, effort) way to check this? Or are we sure that operators we run (ours, external, upstream) never log anything like access tokens?

QuentinBisson commented 1 year ago

I agree customers should be able to access the logs using kubectl logs on the MC anyway.

My question was if Atlas can take this too. Somebody needs to think about the general concept of how we structure access to logs on the management cluster.

Yes we can create the cluster role with potential restrictions if we think it's needed but the binding part from customer to role should be elsewhere (i.e. done by another team).

@giantswarm/team-honeybadger how do you currently create cluster roles at the MC level so we could do the same?

teemow commented 1 year ago

@QuentinBisson my point was that Atlas should think about whether and which roles are needed and takes full ownership of how we allow customers to access logs on management clusters :pray:

@piontec as mentioned above (https://github.com/giantswarm/roadmap/issues/948#issuecomment-1256886603) we need to assess and build up confidence. A simple role to access all pods/log is not what I want. But I would leave the assessment and the structure of the roles to Atlas. I only asked if Honey Badger can unblock this by adding a role for flux.

mproffitt commented 1 year ago

A new role has been introduced int he flux-system namespace with RoleBinding to default:automation and the customer cluster admin group.

This has been tested by Adam from THG and provides him with the log access he requires for Flux.

gianfranco-l commented 1 year ago

added Atlas label and in Atlas board so that they can work on their part of the scope as outlined here

teemow commented 1 year ago

@gianfranco-l @mproffitt we have another request to give access to app-operator logs to debug the installation of an app (ArgoCD) on the management cluster. Can @mproffitt enable this similarly to how he did this for flux?

@TheoBrigitte as you can see this is getting more and more important. Imo customers would also like to inspect logs of capi controllers soon. So a more general set of permissions that allow customers to read the logs on the management cluster would be good. And I don't mean yet via loki. Just via kubectl logs.

mproffitt commented 1 year ago

@teemow Exposing the logs from pods within the giantswarm namespace isn't a problem and I already had an internal ticket for this at https://github.com/giantswarm/giantswarm/issues/24033 although it is not possible to restrict to just the affected pods ((app | chart)-operator-*) - customers will gain access to all logs inside that namespace and we have to question if that is something we want to allow, or if there are valid reasons we shouldn't be allowing access to all logs inside that namespace

IMHO requests seem to be leaning towards granting more and more privileges to things we own on the management clusters and this is something I consider to be potentially dangerous. We need to consider how we can best protect our resources whilst enabling the customer to achieve their own product vision.

giantswarm / roadmap

Flux (and apps) in MC logs visibility to customers #948