grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
64.67k stars 12.1k forks source link

Explore Metrics - Explore Logs Integration #94727

Open itsmylife opened 2 weeks ago

itsmylife commented 2 weeks ago

We should be able to see/reach relevant logs from the metrics app. Design Doc: https://docs.google.com/document/d/1vFqk-Cs_zw5vR-TkuhLI85Fa0709YV07tc3hzIGZtlA/edit

Phase 1: PoC to show logs only for Loki recording rules https://github.com/grafana/grafana/pull/94656

Tasks to ship PoC:

itsmylife commented 1 week ago

@catherineymgui Should we make the logs view fill all the space underneath the selected metric view?

itsmylife commented 1 week ago

@zhehao-grafana Could you please help us with the instrumentation? What metrics/events do we need to track?

zhehao-grafana commented 1 week ago

4 of us need to sync regarding how the initial flow should look like since I don't see a consensus reached in the design doc. I want to make sure we are aligned before adding additional things like event tracking

itsmylife commented 2 days ago

When there are multiple recording rules with the same name but different labels, the user has to pick the distinctive label to see the related logs to that recording rule. I know I sound confusing so please check the following recording rule data.

{
  "rules": [
    {
      "name": "loki_tenant:query_count:lookback_period",
      "query": "sum by (cluster,namespace,org_id)(count_over_time({container=\"query-frontend\", namespace=~\"loki.*\"} |= \"caller=metrics.go\" |= \"start_delta=\" | logfmt | start_delta<=3h0m0s[2m]))",
      "labels": {
        "period": "less than 3h"
      },
      "health": "ok",
      "type": "recording",
      "lastEvaluation": "2024-10-30T16:47:48.398968109Z",
      "evaluationTime": 2.846429586
    },
    {
      "name": "loki_tenant:query_count:lookback_period",
      "query": "sum by (cluster,namespace,org_id)(count_over_time({container=\"query-frontend\", namespace=~\"loki.*\"} |= \"caller=metrics.go\" |= \"start_delta=\" | logfmt | ( start_delta>3h0m0s , start_delta<=12h0m0s )[2m]))",
      "labels": {
        "period": "between 3h and 12h"
      },
      "health": "ok",
      "type": "recording",
      "lastEvaluation": "2024-10-30T16:47:51.245410108Z",
      "evaluationTime": 2.218408009
    },
    {
      "name": "loki_tenant:query_count:lookback_period",
      "query": "sum by (cluster,namespace,org_id)(count_over_time({container=\"query-frontend\", namespace=~\"loki.*\"} |= \"caller=metrics.go\" |= \"start_delta=\" | logfmt | ( start_delta>12h0m0s , start_delta<=24h0m0s )[2m]))",
      "labels": {
        "period": "between 12h and 1d"
      },
      "health": "ok",
      "type": "recording",
      "lastEvaluation": "2024-10-30T16:47:53.463833499Z",
      "evaluationTime": 2.130151779
    }
  ]
}

In the json above we have the same rule name (loki_tenant:query_count:lookback_period) multiple times with different query and labels. We will show the logs but put a warning for such cases to help/guide user to apply period filter (period filter is specific to this recording rule, other rules might have some other labels) so we will be able to show more specific logs.

But for the sake of PoC, we will only show the logs of the first matching recording rule. The logic I explained above will be implemented later on.

cc: @zhehao-grafana @graph-andrew

zhehao-grafana commented 2 days ago

But for the sake of PoC, we will only show the logs of the first matching recording rule.

What if we show all relevant logs of all the rules created? are there any obvious limitations regarding this approach?

itsmylife commented 2 days ago

But for the sake of PoC, we will only show the logs of the first matching recording rule.

What if we show all relevant logs of all the rules created? are there any obvious limitations regarding this approach?

We use the underlying query that Loki recording rule has to fetch the logs. For all logs we need to send multiple queries in one request. I think Loki can handle that, but that needs to be checked. For the first phase, we think showing logs for the first matching rule good enough. What do you think?

NWRichmond commented 2 days ago

But for the sake of PoC, we will only show the logs of the first matching recording rule.

What if we show all relevant logs of all the rules created? are there any obvious limitations regarding this approach?

Another consideration is that we are currently limiting logs queries to the first 100 logs, so even if we combine the queries, only the first 100 log lines will be shown (and the first query could easily produce >100). We can increase this value, but 100 seemed like a reasonable place to start, as per @svennergr 's suggestion here.

zhehao-grafana commented 2 days ago

We use the underlying query that Loki recording rule has to fetch the logs.

Let's do this then, but we can consider providing users with more information about which one we end up selecting, like certain filter details on the loki side