lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.43k stars 192 forks source link

Grafana dashboard not displaying any logs in loki panels #761

Closed candlerb closed 4 months ago

candlerb commented 4 months ago

Required information

Issue description

This is an issue with the grafana dashboard, https://grafana.com/grafana/dashboards/19727-incus/

Metrics display is working fine, but the loki panels at the bottom are empty. This is because the LogQL queries are wrong.

They have {app="incus",type="lifecycle",instance="$job"} for the first panel, and {app="incus",type="logging",instance="$job"} for the second.

However the "instance" label in logs don't contain the job name or the container name. Also, the $job variable in Grafana is the prometheus scrape job name, and has nothing to do with loki logs.

Here are some example logs:

# logcli query '{app="incus"}' --since=1h --limit=2000
2024/04/17 08:18:06 http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&end=1713341886833710643&limit=1000&query=%7Bapp%3D%22incus%22%7D&start=1713338286833710643
2024/04/17 08:18:06 Common labels: {app="incus", instance="none", location="none"}
2024-04-17T07:57:25Z {name="netbox4", project="default", type="lifecycle"} requester-username="root" action="instance-started" source="/1.0/instances/netbox4" requester-address="@" requester-protocol="unix" instance-started
2024-04-17T07:57:25Z {type="logging"}                                      context-action="start" context-created="2024-04-03 13:50:03.830202559 +0000 UTC" context-ephemeral="false" context-instance="netbox4" context-instanceType="container" context-project="default" context-stateful="false" context-used="2024-04-03 13:50:09.473557693 +0000 UTC" level="info" Started instance
2024-04-17T07:57:24Z {type="logging"}                                      context-action="start" context-created="2024-04-03 13:50:03.830202559 +0000 UTC" context-ephemeral="false" context-instance="netbox4" context-instanceType="container" context-project="default" context-stateful="false" context-used="2024-04-03 13:50:09.473557693 +0000 UTC" level="info" Starting instance
2024-04-17T07:57:23Z {name="netbox4", project="default", type="lifecycle"} requester-protocol="unix" requester-username="root" action="instance-shutdown" source="/1.0/instances/netbox4" requester-address="@" instance-shutdown
2024-04-17T07:57:20Z {type="logging"}                                      context-action="shutdown" context-created="2024-04-03 13:50:03.830202559 +0000 UTC" context-ephemeral="false" context-instance="netbox4" context-instanceType="container" context-project="default" context-timeout="10m0s" context-used="2024-04-03 13:50:09.473557693 +0000 UTC" level="info" Shutting down instance
2024-04-17T07:55:50Z {name="nfsen", project="default", type="lifecycle"}   source="/1.0/instances/nfsen" context-command="[su -l]" action="instance-exec" instance-exec
2024-04-17T07:41:45Z {type="logging"}                                      level="info" Done updating images
2024-04-17T07:41:45Z {type="logging"}                                      context-err="Failed getting remote image info: Failed getting image: The requested image couldn't be found" context-fingerprint="d9802562f513aa78eb45bfb68a47194b144da9a08639e6fcf8c54f9b261e3c65" level="error" Failed to update the image
2024-04-17T07:41:44Z {type="logging"}                                      context-err="Failed getting remote image info: Failed getting image: The requested image couldn't be found" context-fingerprint="c533845b5db1747674ee915cbb20df6eb47c953bb7caf1fec5b35ae9ccf98c18" level="error" Failed to update the image
2024-04-17T07:41:41Z {type="logging"}                                      level="info" Done pruning expired backups
2024-04-17T07:41:41Z {type="logging"}                                      level="info" Pruning expired backups
2024-04-17T07:41:41Z {type="logging"}                                      level="info" Updating images
2024-04-17T07:40:43Z {type="logging"}                                      level="info" Done updating images
2024-04-17T07:40:43Z {type="logging"}                                      context-err="Failed getting remote image info: Failed getting image: The requested image couldn't be found" context-fingerprint="d9802562f513aa78eb45bfb68a47194b144da9a08639e6fcf8c54f9b261e3c65" level="error" Failed to update the image
2024-04-17T07:40:42Z {type="logging"}                                      level="info" Done pruning expired backups
2024-04-17T07:40:42Z {type="logging"}                                      level="info" Updating images
2024-04-17T07:40:42Z {type="logging"}                                      level="info" Pruning expired backups
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Done pruning expired backups
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Done updating images
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Updating images
2024-04-17T07:35:46Z {type="logging"}                                      level="info" Pruning expired backups
2024/04/17 08:18:06 http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&end=1713339346929598024&limit=1000&query=%7Bapp%3D%22incus%22%7D&start=1713338286833710643
2024/04/17 08:18:06 Common labels: {app="incus", instance="none", location="none", type="logging"}
#

Notice how some logs have the name of the container as a label (name="netbox4"), but some other logs relating to this container don't. They may have it buried in the logfmt data though, e.g. context-instance="nfsen" or source="/1.0/instances/nfsen". If you filter logs by container, you still want to see those logs.

I propose that at simplest, the queries need to change to:

# option 1 (simple)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"}
{app="incus", type="logging", name=~"|$name", project=~"|$project"}

The vertical bar inside the regexp is because the "name" and "project" labels may be missing (even for logs specific to one container), so we must allow through lines where this label is missing.

However, that will also show logs for other containers, when those logs have no name or project label. A bit of additional filtering can ensure that the container name appears somewhere in the log line:

# option 2 (more selective)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"} |~ "$name"
{app="incus", type="logging", name=~"|$name", project=~"|$project"} |~ "$name"

This now works as expected:

image

image

However, if one container name is a prefix of another container name, or two projects have containers with the same name, it may show some logs for another container. A more sophisticated filter is possible:

# option 3 (complex)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"}  | logfmt | context_instance=~"|$name" | context_project=~"|$project"
{app="incus", type="logging", name=~"|$name", project=~"|$project"}  | logfmt | context_instance=~"|$name"  | context_project=~"|$project"

This assumes that every log relating to container X either has label name="X" or the log message contains context_instance="X". (Note that hyphens in logfmt attributes are converted to underscores, so that they become valid LogQL label names)

This is true for the logs shown above. In fact, in these examples the lifecycle logs all have name="X",project="Y" and the event logs have context_instance="X",context_project="Y", so the queries can simplify to:

# option 4 (final)
{app="incus", type="lifecycle", name=~"|$name", project=~"|$project"}
{app="incus", type="logging"}  | logfmt | context_instance=~"|$name" | context_project=~"|$project"

I've tried this and it works for me. However I'm not sure if those conditions are true in general for all possible logs from incus. It could be argued it's hard-coding too much info about the log attributes.

Steps to reproduce

  1. Install loki
  2. On incus host, turn on Loki logging: incus config set loki.api.url=http://loki.example.net:3100
  3. Start and stop a container, check it creates some logs in Loki
  4. Install grafana, add Loki as data source. Use "Explore" to browse Loki logs and check you can see the incus logs.
  5. Install the incus grafana dashboard, 19727
  6. Open the incus dashboard, find that no Loki logs are visible
  7. Edit panels and adjust the queries to those given above; logs appear
  8. Try changing the selected project and/or container, and check that logs are filtered appropriately.

Cross-reference

Issue appears to be inherited from lxd dashboard, raised previously: https://github.com/canonical/lxd/issues/13165

stgraber commented 4 months ago

You need to set loki.instance to match the name of your prometheus job as that's what we get through the dropdown at the top of the dashboard.

candlerb commented 4 months ago

But if you do that, you lose the visibility of which host each log message came from. If your prometheus job for scraping metrics is called "incus" then you'll have to set loki.instance="incus" on all nodes, and then all logs will just say instance="incus". That isn't very useful.

Whereas for metrics, instance="XXX" tells you the host which is scraped, and therefore which host the container is running on. This is true even for clusters according to the documentation:

In a cluster environment, Incus returns only the values for instances running on the server that is being accessed. Therefore, you must scrape each cluster member separately.

It would be strange if metrics had instance="nuc1", instance="nuc2", instance="nuc3" but logs all had instance="incus".

stgraber commented 4 months ago

You don't, the location field still tells you where things are.

stgraber commented 4 months ago

image

candlerb commented 4 months ago

I get location="none" in my loki logs (you can see it in the examples above), and I don't see a way to override it. Unless you mean I should use loki.labels to set it?

I also get instance="none" when loki.instance is unset (reported separately at #762). Documentation does say it should default to the hostname, which implies it should work like a prometheus instance label.

stgraber commented 4 months ago

Yeah, those two fields seem wrong in non-clustered case, both should default to your hostname in non-clustered cases and instance should be overrideable through loki.instance.

stgraber commented 4 months ago

From your suggested changes above, I'll be taking the project~="|$project" part as that makes sense and works well.

The $name part doesn't work because when you select All it will pass a regexp of all individual instances through the filter rather than passing an empty string. So in my case it means passing several thousand instance names through which seriously impacts Loki and it also means that any log or lifecycle even which doesn't have a name field set won't work.

Also name in a Loki entry doesn't necessarily mean instance name, if the event is network or storage related, it may refer to a network or storage pool. So I think we need to stay away from filtering based on instances for now.

candlerb commented 4 months ago

The $name part doesn't work because when you select All it will pass a regexp of all individual instances through the filter rather than passing an empty string. So in my case it means passing several thousand instance names through which seriously impacts Loki and it also means that any log or lifecycle even which doesn't have a name field set won't work.

Oops, I had hoped it would do something sensible like empty string or ".*".

However, apparently there is a "custom all value", which I found via here and here.

Also name in a Loki entry doesn't necessarily mean instance name, if the event is network or storage related, it may refer to a network or storage pool. So I think we need to stay away from filtering based on instances for now.

OK, fair enough - although if "All" did match .* then it would be OK.