giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Log-Receiver: Investigation #3567

Open Rotfuks opened 1 month ago

Rotfuks commented 1 month ago

Motivation

We need to make sure customers can receive logs from outside the installations. For this we need to first find out, how exactly we can achieve this. Do we need a new component, can we reuse an already existing one, or do we have to create our own thing?

Investigation

Outcome

QuentinBisson commented 1 month ago

Hey @giantswarm/team-atlas so I've been doing some investigation on this topic for a while as I was getting into open-telemetry and I think we could do it in multiple ways:

I definitely would be in favor of solution 2 because I think it's the most useful one future wise but it will most likely take longer.

Now, to my point about the api keys, that is something we could start thinking about today. Do we think it makes sense to move to some kind of PKI for this?

Rotfuks commented 1 month ago

I would also support the second solution because it's the most secure and future-safe approach. Don't want to make the platform less secure and put a legacy agent we wanted to get rid of for such a niché feature. 
So what do we need to find out to have a full concept? Do we want to create follow up tasks or is it fine to do it here? Next step for me would be to have a rough concept on a target state that we want to achieve here.

QuentinBisson commented 1 month ago

I need to draw something yes so we can discuss it as a team tomorrow and find the end state we all want, I will try to do it later tonight to explain where I think our observability platform should go to be able to support more features and ideally otel OOTB. I wanted to do it today but life got in the way.

Once we have this, we can agree on steps we want to do in the implementation phase :)

QuentinBisson commented 1 month ago

Here is the schema:

Image

I'm not adding anything related to secret management but it should be here

QuentinBisson commented 1 month ago

@giantswarm/team-atlas As a rough plan for this, what I am envisionning is to:

  1. Deploy an instance of alloy on the MC acting as an oltp eceiver to be able to receive logs I'm not sure how auth would work because the receivers (both oltp and loki.source.api) support any kind of auth so this would require a gateway in front that checks for auth. Could be the multi-tenant gateway :) oltp reveiver with include_metadata ensures headers are forwarded in the pipeline context (useful for the tenant id)
  2. Define a new CRD (not sure what to name it, maybe source.observability.giantswarm.io but this can have another name later like datasource, apikey, datacollector, whathever works) to define a source of data that is managed by the observabiity operator. Idea would be that when we create it, it creates an API key secret (linked un thé CR status) for the source of data so customers could use this CR to get a secret, we could use this for teleport logs. Idea for improvements, the observability operator can create a source CR for each WC and logging operator would use the created secret as a source instead of also creating secrets
  3. Configure the alloy gateway ingress to check headers secrets and have logs in Loki which I would assume means we need the multi-tenant proxy to be deployed as a standalone component outside of the loki namespace. This also means the mtp should not set the tenant anymore but it should bé coming as thé x-scope-ordid header from promtail ?

Maybe you can think of an easier way to make this work for now? For instance, I assume we could leave the multi-tenant gateway where it is now and have the observability operator configure the api key and have the oltp receiver send logs to loki by bypassing the gateway so we can move forward and do the better solution later but I'm not sure I like this.

The main idea here is to make sure @QuantumEnigmaa and @TheoBrigitte can work on the implementation phase if we think this is legit :)

Rotfuks commented 1 month ago

Can we shift the perspective slightly and look at it from a customer journey perspective as well? 
With that setup and with our fully self-service platform mindset, how would the customer then configure new sources of data from outside? By extending the CRD or creating a CRD for every new data source? And that CRD then lives in the observability folder of his installation repo?

QuentinBisson commented 1 month ago

In that journey, they could create the CR with whatever name they want and thé operators would generate a secret they would need get to configure their logshipper. It's thé best we can do without any ui integration

Rotfuks commented 1 month ago

So they need a logshipper that sends the data to alloy which only receives but doesn't scrape? :) 

For example: customer A wants to get the logs of a Cloud Service Database that is connected to their app in the cluster. So they set up fluentbit or whatever tool they like with access to the DB app, then they create the cloudDB-CR which generates a secret. Now where do they access that secret? 

Once they accessed it and have the secret they add it to fluentbit, with the target to send it to (where do they get that target?) and finally babam, logs in Grafana?

QuentinBisson commented 1 month ago

So yes they create a CR on the MC and they check the status of that cr on thé mc to get the name of the secret and get that secret value on the mc as well. Maybe it's not the best user journey, but i'm not sure any other would bé approved by security.

Once they have thé secret they should send data to our alloy on thé mc, which is one of the main reason why a single ingress for observability would bé helpful

marieroque commented 1 month ago

I like the idea of the gateways: observability-gateway (Alloy for now IIUC) in the MC and o11y-data-gateway in the WC.

I'm fine with the Source CRD to allow the customer adding new data sources and get credentials to send their data to us.

I like the way you propose to configure the observability bundle.

The tenant/organization configuration is still not clear to me.

The topology type is a nice idea, but not sure it's the priority.

QuentinBisson commented 4 weeks ago

So coming back to use cases for @Rotfuks because I'm on my laptop today and I can explain better :D

I'll call the CRD Omega because I don't want to induce anyone's opinion, even my own on how we should name it

How to send logs to our Managed Loki

1. Generate an API Key

Remarks

2. Configure the application:

Remarks

3. Go to grafana and see your logs :)

Remarks

It could be nice to be able to have a view of that pipeline in some kind of blocks in grafana I guess like nameoftheomegatype - > alloy - > loki to be able to debug where it's blocked if possible? But we can improve with customer feedback.

@Rotfuks maybe something for you: It would be nice to check with honeybadger if the generated secret could be sent via flux to the gitops repo (ciphered by sops) but I think it's highely unlikely and maybe check with shield if they think is is a good way to generate api keys?

@Rotfuks this is definitely something for you: @stone-z brought up that us ingesting customer logs will need to be discussed regarding ISO

QuentinBisson commented 4 weeks ago

Interesting idea that came from a discussion with honeybadger, we should probably use the external storage operator to push the secret back to customers (https://external-secrets.io/latest/api/pushsecret/) and most likely to create the api key as well https://external-secrets.io/latest/api/generator/password/ (less code for us)

QuentinBisson commented 4 weeks ago

Also, maybe a better Idea, let's not use api keys but oidc in front of the gateway so tokens are rotated?

QuentinBisson commented 4 weeks ago

Let's wait for feedback from @giantswarm/team-bigmac https://gigantic.slack.com/archives/C053JHJC99Q/p1722015045075429

Rotfuks commented 1 week ago

Alright, big mac is sadly completely overloaded with topics already. Let's talk once you're back what exactly we need from BigMac and how we can reduce the dependency to them. Maybe we can boil it down to a kickoff workshop so we can do the PoC on our own. I'll discuss it further.

TheoBrigitte commented 4 days ago

I like the idea of using Alloy as our OpenTelemetry gateway, it does support a wide variety of receivers (OpenTelemetry, Datadog, Jaeger, Kafka, etc ...). But I would also like to know what are the use cases and which receiver should we support. I am also unsure how it would preform comparing to exposing Loki or Mimir directly, but this would also limit our capabilities in terms of receiver and would expose critical service like Mimir to the outside.

I would also be interested on defining a high level user journey with this new solution.

1. Generate an API Key

Remarks

  • I would think secret rotation would be quite easy to do, delete the secret CR and let operator recreate it if the one in status does not exist

I would rather have the user create a new Omega CR to get a new API key, rather than have them delete the secret attached to the current Omega CR, this make things complicated IMO.

2. Configure the application:

  • Customers would need to configure the api_key, the oltp gateway endpoint and the tenant id header in their log shipper (as long as it support oltp). Teleport would fall under that category

We first need to figure out how we implement Authentication and how we add support for the different protocols (http, grpc, thrift_http, and the like)

  • Operators get the secret from the MC and configure the logging agents on the WCs

What do you mean by this?

3. Go to grafana and see your logs :)

I think it would be good to point user on where and how they can visualize their data in Grafana.

Remarks

It could be nice to be able to have a view of that pipeline in some kind of blocks in grafana I guess like nameoftheomegatype - > alloy - > loki to be able to debug where it's blocked if possible? But we can improve with customer feedback.

This would be a view for us for debugging purposes right ?

QuentinBisson commented 4 days ago

I like the idea of using Alloy as our OpenTelemetry gateway, it does support a wide variety of receivers (OpenTelemetry, Datadog, Jaeger, Kafka, etc ...). But I would also like to know what are the use cases and which receiver should we support. I am also unsure how it would preform comparing to exposing Loki or Mimir directly, but this would also limit our capabilities in terms of receiver and would expose critical service like Mimir to the outside.

I would also be interested on defining a high level user journey with this new solution.

So originally, at least for receiving logs, I wanted to reduce the surface by only opening the default OLTP port (http and GRPC) as it is usually supported pretty well and we see with time if we need to open more things. The main advantage is that the gateway would act as an authentication proxy and reduce the attack surface by a lot because well, we would have only 1 ingress

1. Generate an API Key

Remarks

  • I would think secret rotation would be quite easy to do, delete the secret CR and let operator recreate it if the one in status does not exist

I would rather have the user create a new Omega CR to get a new API key, rather than have them delete the secret attached to the current Omega CR, this make things complicated IMO.

We can discuss that at the end of the week for sure. We are currently having discussions to see if we can use OIDC instead of API keys to actually make it more secure so I did not spend too much time investigating this. The recreaction part is also because our operator use it so we cannot really create a new secret so easily

2. Configure the application:

  • Customers would need to configure the api_key, the oltp gateway endpoint and the tenant id header in their log shipper (as long as it support oltp). Teleport would fall under that category

We first need to figure out how we implement Authentication and how we add support for the different protocols (http, grpc, thrift_http, and the like)

I linked this page in the early discussions https://grafana.com/docs/alloy/latest/reference/components/otelcol/otelcol.receiver.otlp/ that explains how to enable the oltp receiver and there are also extra components with auth in them https://grafana.com/docs/alloy/latest/reference/components/otelcol/otelcol.auth.bearer/ but ongoing discussions ar e going towards OIDC and either DEX or SPIFFE/SPIRE but we need to see if that is achievable in a possible timeline which I highly doubt.

  • Operators get the secret from the MC and configure the logging agents on the WCs

What do you mean by this?

This was me explaining how our operators will also use the Omega CRD

3. Go to grafana and see your logs :)

I think it would be good to point user on where and how they can visualize their data in Grafana.

Yes and I hope this gets built into backstage

Remarks

It could be nice to be able to have a view of that pipeline in some kind of blocks in grafana I guess like nameoftheomegatype - > alloy - > loki to be able to debug where it's blocked if possible? But we can improve with customer feedback.

This would be a view for us for debugging purposes right ?

For us and cutomers yes, kinda like the alloy ui but in grafana :)

stone-z commented 4 days ago

Just to echo my comments in the internal threads, I think a homegrown API key mechanism is the wrong way to go here. The tools exist in the ecosystem to use an identity-based authn/z scheme. Aside from being more secure, we already have other use cases for it, it is a great platform feature, and it ends up being less work in the long run anyway

QuentinBisson commented 4 days ago

I totally agree here @stone-z, but you know I'm a bit skeptical when it comes to a possibe timeline for say Spire :D