Closed amseager closed 1 year ago
@thampiotr is this something you'd be able to look into?
Sure, will take a look!
Hi @amseager - I've tried to reproduce this issue, but so far no luck. I can still try a few options a bit later.
In the meantime, could you share the configuration that you are using for this log file (whether .yaml in static mode or .river in the flow mode).
Also, could you share the metrics that the agent reports (e.g. curl localhost:12345/metrics
) if that's possible? Particularly interested in metrics in promtail_
or loki_source_file_
namespaces.
@thampiotr Hi, sure.
Loki part in grafana-agent config:
logs:
configs:
- name: default
clients:
- url: ${LOKI_URL}/loki/api/v1/push
positions:
filename: /tmp/agent/positions.yaml
scrape_configs:
- job_name: default
static_configs:
- targets: [localhost]
labels:
job: default
__path__: /tmp/logs/*/*
pipeline_stages:
- regex:
expression: '^\d+ \[.+?\] \[(?P<application_name>.+?),.+?\] .*? \[.+?\] (?P<log_level>\w+?) .*$'
- labels:
application_name:
log_level:
Agent metrics:
I masked the names of the services (sorry for that, it's from a live environment) but left the actual metric values.
The problem that I described mostly affects what's called "A-service" there, maybe this matters. This service generates much more logs than the others and rotates log files every 10-15 min. However, other services can face the same problem sometimes.
Logs rotation is done by the Logback library, RollingFileAppender (with SizeAndTimeBasedRollingPolicy - depends on either file size or time). The "rotated" logs are moved to a separate folder in tar.gz format.
Please tell me if I need to provide something else.
Thanks @amseager, this was very helpful. A few follow-up questions:
component=logs
? Specifically for the period just before and right after the "skipping update of position...". This should give me some information about what filesystem operations does the Agent sees (if any). @amseager one more request if possible, can you include a goroutine dump of the agent? this can be obtained via http://localhost:12345/debug/pprof/goroutine?debug=1
.
@thampiotr
please check the logs for 10 min or smth: https://gist.github.com/amseager/621ec009f4d239d205ae5e19e148ad7d
Goroutine dump: https://gist.github.com/amseager/e37f31984facd9cbc51ce0f2153d6864
1 - they can be for the same path and for different ones at the same time. For example, 10 lines for A-service, 1 for B-service, etc. for a single "10-seconds-iteration"
2 - apart from unrelated things, I see some errors like the next one happening sometimes, maybe this is related:
ts=2023-05-19T14:49:42.756314281Z caller=tailer.go:159 level=info component=logs logs_config=default component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/tmp/logs/A-service/A-service-1 reason="Error reading /tmp/logs/A-service/A-service-1: read /tmp/logs/A-service/A-service-1: stale NFS file handle"
3- correct, if I restart the agent, it will work well for some time, then a small amount of these errors start to appear (maybe after the first rotations, it's hard to understand), then there will be more and more of them. The logs above are taken from the agent which has been working for ~3 days. Although there are tons of them for a short period of time, I still don't see any issues like lost logs in Loki or lack of resources etc.
Thanks @amseager for sharing all the info. I found a goroutine leak and it is likely related to stale NFS file handle
error you saw. Still need to figure out exactly where the fix is needed, but we're getting closer.
While at it, I'll be making sure we port https://github.com/grafana/tail/pull/16 too.
Reopening, since there are still affected code paths in the agent and PRs to update the dependencies needed.
This is currently blocked by #3660.
Some more context: in order to fix this in agent
static mode, we'll need to update some promtail
dependencies from loki
. When doing so, we get dependency conflicts with Otel Collector. So the update of Otel Collector should happen first.
Thanks for the update. Maybe you know if it's safe to ignore the problem or if any actions/workarounds need to be done before the fix is released? For now, I reduced the number of logs for some apps, and I also try to restart the agent every few days.
This is caused by a goroutine leak, which can result in excessive, unnecessary logging (as you have noticed), growing memory usage of the agent over time (though it shouldn't be too fast) and some additional context switching. I'd say this is safe to ignore until the symptoms get too bad, at which point a restart is a good remedy.
Apologies for the hassle caused by this and thanks for reporting the problem!
This is now fixed in main
with https://github.com/grafana/agent/pull/3942 and https://github.com/grafana/agent/pull/4565 adding the release note. Should be released in v0.36.0
- it was blocked by upgrading our Prometheus & OTEL dependency hence the delay.
Hello, I am facing same issue in grafana agent v0.36.2 as well. Grafana Agent Operator version is v0.34.1 I am using grafana agent in Kubernetes AKS.
Getting the below logs:
ts=2024-01-11T02:39:05.911958356Z caller=filetargetmanager.go:181 level=info component=logs logs_config=random-shared/kubernetes-pod-logs msg="received file watcher event" name=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/destination/6736.log op=CREATE
ts=2024-01-11T02:39:05.912455361Z caller=tailer.go:145 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="tail routine: started" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/destination/6736.log
ts=2024-01-11T02:39:05.912477961Z caller=log.go:168 component=logs logs_config=random-shared/kubernetes-pod-logs level=info msg="Seeked /var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/destination/6736.log - &{Offset:0 Whence:0}"
ts=2024-01-11T02:39:06.93727296Z caller=log.go:168 component=logs logs_config=random-shared/kubernetes-pod-logs level=info msg="Re-opening moved/deleted file /var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log ..."
ts=2024-01-11T02:39:06.937438161Z caller=log.go:168 component=logs logs_config=random-shared/kubernetes-pod-logs level=info msg="Waiting for /var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log to appear..."
ts=2024-01-11T02:39:08.979770282Z caller=tailer.go:206 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="skipping update of position for a file which does not currently exist" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135688578Z caller=tailer.go:206 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="skipping update of position for a file which does not currently exist" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135755578Z caller=tailer.go:163 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log reason=null
ts=2024-01-11T02:39:10.135789979Z caller=tailer.go:154 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="tail routine: exited" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135809179Z caller=tailer.go:118 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="position timer: exited" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135822179Z caller=tailer.go:242 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="stopped tailing file" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
Should I update the operator as well?
Facing the same issue in grafana/agent v0.39.0 as well after upgrading. @thampiotr any help?
Same issue here on grafana/agent:v0.39.1
.
Hey @tpaschalis any help?
Hey folks, we don't monitor closed issues or PRs. I happened to notice this in my notifications list, but it's not a reliable way of getting attention.
If you need help for something which is closed and want to make sure it gets seen, please open a new issue or PR.
What's wrong?
I have agent+loki+grafana setup, and the agent is configured to tail the log files of my apps and then push them to loki. If you start all of this, after some time this message starts appearing in the agent logs (every 10 sec or smth):
Eventually, you'll get more and more errors like this at once, so it becomes almost impossible to check the agent logs.
I found some issues that seem to be related: https://github.com/grafana/loki/issues/3108 https://github.com/grafana/loki/issues/3985
My services indeed roll their log files after reaching the max file size, although I don't see any problems with "losing" logs or smth like that (loki stores everything in its storage). I'm also not sure about performance problems, maybe I'll face them later.
Is it critical? I see that there was this PR, maybe it needs to be merged here as well?
Steps to reproduce
I'm not 100% sure that having log retention is necessary to reproduce it though.
System information
No response
Software version
0.31.3
Configuration
No response
Logs