Self Service Log Ingestion

Rotfuks commented 2 months ago

Motivation

We want customers to be able to ingest whatever data is relevant for them in a self service way, this also includes logs. So we need to make sure we have a way how they can add their own data sources for logs.

Todo

[ ] investigate how exactly we could empower customers to add their own log sources - for example PodLogs https://github.com/giantswarm/giantswarm/issues/29072
[ ] Implement any needed changes to make it happen
[ ] Create a documentation draft how customers can add their own sources of events/logs to be monitored
[ ] Get some feedback from AE about the documentation
[ ] Make sure the documentation is published in the new observability platform docs

Outcome

Customers can now add their own sources of events to be monitored by the observability platform
there is docs and educational content out there showing them how it's done

QuentinBisson commented 1 month ago

We see that we can use pod logs but do we want to force customers to create pod logs for log ingestion? Can we allow them to collect logs at the namespace level (with annotations and so on)?

Rotfuks commented 1 month ago

How much effort is it to create podlogs for customers? I would love to have some label based stuff where we can just say "add this label and it's automatically ingested" because that makes it quite flexible and intuitive. It will also help us with mutli-tenancy I believe.

QuentinBisson commented 1 month ago

The issue I have is not that pod logs don't make sense but I would think they should be used on really rare occasions. Ideally, an annotation/label on the pod or namespace should be enough to get most the tenant for most log and that would make profiles and traces collection easier. I would only use pod logs if the pod needs a custom pipeline imo

What i'm not sure is if we can get alls logs for a namespace if it's annotated unless the pod has it's own label and unless it's equipped with a pod log?

I would think we could do something with drops but I'm not sure. Maybe @TheoBrigitte knows if log sources can exclude data taken from other sources?

TheoBrigitte commented 1 week ago

When using Alloy as logging agent installed within a workload cluster, we configure it in a way which would allow to retrieves logs from specific namespaces and/or pods.

This solution makes use of 2 differente PodLogs (with mutual exclusion):

One PodLog to select all pods from namespaces with a specific label podlogs_ns.yaml.txt
One PodLog to select all pods with a specific label podlogs_pod.yaml.txt

Those PodLogs would be configured by us and customers would only deal with labels on their resources.

With this solution we might face a problem with resources usage on the Kubeletes, as all the log traffic would happen via the Kubernetes API the network and CPU usage on Kubelet might be problematic especially in cases where many/all pods would be monitored. Alloy does not currently provide another way to select targets based on their namespace. The usual loki.source.file does not suffer from Kubelet resources usage problem as logs are directly retrieved from the node where Alloy is running, but it does not allow to select pods by namespace.

I opened an upstream issue requesting to add the namespace metadata within the discovery.kubernetes component, this would allow us to avoid using PodLogs and suffering from their overhead.

QuentinBisson commented 1 week ago

Did you take a look at this?https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.kubernetes/

TheoBrigitte commented 1 week ago

Did you take a look at this?https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.kubernetes/

Looking it, this would be simpler than the currently used local.file_match in our solution, but I also do not see the benefit over loki.source.podlogs, you get rid of the need for PodLogs resources but also loose the capability to filter on namespaces labels and you still have the network and CPU overhead on the Kubernetes API server.

QuentinBisson commented 1 week ago

I quite like that we do not have to run it as a daemonset though :D

But why do you not have the namespace ? I thought those should give you __meta_kubernetes_namespace in the loki.process or relabel phase?

QuentinBisson commented 1 week ago

Oh you meant namespace labels,nevermind

TheoBrigitte commented 1 week ago

Using a combination of loki.relabel and loki.source.podlogs components it is possible to set the tenant id based on a given label from the pod or its namespace.

In the following example the tenant id is taken from the pod label foo.

Here is the config and the PodLog resource I used

Alloy config


loki.source.podlogs "default" {
forward_to = [loki.relabel.default.receiver]
}

loki.relabel "default" { forward_to = [loki.write.default.receiver]

rule { action = "replace" source_labels = ["foo"] target_label = "__tenant_id__" replacement = "$1" regex = "(.*)" }

rule { action = "labeldrop" regex = "^foo$" } }

loki.write "default" { endpoint { url = "https://loki.svc/loki/api/v1/push" } }

* PodLog (note: this will select all pods from all namespaces, change the selectors to fit your need)
```yaml
apiVersion: monitoring.grafana.com/v1alpha2
kind: PodLogs
metadata:
  name: pod-tenant-id-from-label
spec:
  selector: {}
  namespaceSelector: {}
  relabelings:
  - action: replace
    sourceLabels: ["__meta_kubernetes_pod_label_foo"]
    targetLabel: "foo"
    replacement: "$1"
    regex: "(.*)"

It is also possible to set the tenant id using the loki.process component which has a tenant stage which allow for exactly this; setting the tenant id, but from there only log entry content are accessible. More info at https://grafana.com/docs/alloy/latest/reference/components/loki/loki.process/#stagetenant-block

giantswarm / roadmap