grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.48k stars 3.4k forks source link

logql: Metric aggregations don't work on labels extracted by logfmt parser if using parameters #11334

Open wbh1 opened 10 months ago

wbh1 commented 10 months ago

Describe the bug When using the logfmt parser with parameters (e.g. | logfmt status, method="request_method") for metric queries, the extracted fields show properly in the logs sample if aggregating by a different field, but are not able to be used for aggregations and filtering is inconsistent.

For example:

sum by (method) (
  rate({component="example"} | logfmt method="request_method", status |  __error__ = `` | status != `` [1m])
)

That query will return no data because it chokes on that status filter.

If I remove the pipe to check for an empty status value (or change it to something like status!="404"), it will work but return incorrect results with only one series ({method=""}):

sum by (method) (
  rate({component="example"} | logfmt method="request_method", status |  __error__ = `` [1m])
)

However, if I remove the parameters and instead use label_format, the query will succeed with the correct results (even with the status filter still in place):

sum by (method) (
  rate({component="example"} | logfmt | label_format method=request_method |  __error__ = `` | status != `` [1m])
)

To Reproduce Steps to reproduce the behavior:

  1. Run a metrics query using logfmt to extract specific fields
  2. Attempt to sum by an extracted field
  3. Query returns no data

Expected behavior I expect to be able to aggregate by labels extracted by the logfmt parser even when specifying the specific labels to extract via parameters.

Environment:

Screenshots, Promtail config, or terminal output N/A

wbh1 commented 4 days ago

We are on Loki 2.9.10 now and this is still an issue. Not sure if it's solved in Loki v3.

Unfortunately, after moving to the TSDB index, we've observed significant performance penalties when using label_format for metric queries (still separately looking into how that's related). In many cases, we are seeing queries using label_format take >20x longer to complete over log volumes of ~15-20GB.


For anyone else coming to this issue, I've found there are two alternatives to using label_format that are roughly equivalent in performance. Both options complete in ~9-11s on ~15GB of logs within the [1m] range selector, although the line_format one seems to be slightly faster on average (and is more readable).

Using line_format and another logfmt stage

This feels dirty, but you can use line_format to effectively rewrite the log line (still in a logfmt pattern) with the K/V pairs you want and run that line through the logfmt processor again.

sum by (instance, datacenter, cluster, logicalcluster, environment, method, status, rgw_status)(
  rate(
    {component="ceph", instance=~"myserver.+", path="/var/log/nginx/access.log"}
      | logfmt status, request_method, us_statuses
      | __error__=""
      | line_format `status="{{ .status }}" method="{{ .request_method }}" rgw_status="{{ .us_statuses }}"`
      | logfmt [1m]
  )
)

Using label_replace

This feels less hacky, but is far less readable.

sum by (instance, datacenter, cluster, logicalcluster, environment, method, status, rgw_status)(
  label_replace(
    label_replace(
      rate(
        {component="ceph", instance=~"myserver.+", path="/var/log/nginx/access.log"}
          | logfmt status, request_method, us_statuses
          | __error__="" [1m]
      ),
      "method",
      "$1",
      "request_method",
      "(.*)"
    ),
    "rgw_status",
    "$1",
    "us_statuses",
    "(.*)"
  )
)