Open julien-sugg opened 2 years ago
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
up
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
@julien-sugg is this still an issue for you?
I experience the same issue. Any progress on it?
promtail_1 | level=error ts=2023-11-14T01:14:26.420779961Z caller=target.go:111 msg="failed to receive pubsub messages" error="rpc error: code = InvalidArgument desc = Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.\nerror details: name = ErrorInfo reason = EXACTLY_ONCE_ACKID_FAILURE domain = pubsub.googleapis.com metadata = map[xxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID]"
I noticed that after first gcp error, promtail stops sending any logs. I have to restart it each day.
It seems that promtail stops sending all logs after hitting first gcp error. Here is my config:
server:
disable: true
positions:
filename: /tmp/positions.yaml
clients:
- url: http://main:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
host: xx.statuspage
promtail: xx.statuspage
__path__: /var/log/*log
- job_name: gcplog
gcplog:
project_id: xxxxxxxx
subscription: grafana-subscription
use_incoming_timestamp: false # default rewrite timestamps.
labels:
job: gcplog
relabel_configs:
- source_labels: ['__gcp_resource_type']
target_label: 'resource_type'
- source_labels: ['__gcp_resource_labels_project_id']
target_label: 'project'
Initially I thought that only gcp logs are affected, but after a first gcp error I see no logs from this node at all.
@julien-sugg is this still an issue for you?
Hi, all my apologies, I totally forgot about this pending topic.
I am not working on GCP anymore, hence I cannot provide you any update.
My current solution to the problem is to add crontab job * * * * * /.../bin/restart-if-gcp-failed
#!/bin/bash
# http://www.gnu.org/software/bash/manual/bash.html#The-Set-Builtin
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -o nounset -o errexit -o pipefail
script=`realpath $0`
scriptdir=`dirname $script`
scriptname=`basename $script`
cd $scriptdir/..
if docker-compose logs | grep error; then
docker-compose restart
date >> restart
fi
Greetings,
Describe the bug GCP Logs forwarding appears to randomly hang and doesn't perform any retry.
To Reproduce I don't have any determinist way to reproduce the issue yet, however I'll try to provide as much details as possible.
Promtail is running in a 3 nodes test cluster using the Grafana Promtail helm chart, and it suddenly stopped working 7 days ago (only the GCP Logs stopped flowing. The file based logs for containers running within the nodes were still successfully retrieved). I was able to "reproduce" it when I restarted all the Pods within the DaemonSet and it occurred again for one of them, however it may be also related to the rate limitations (not sure).
Before the DaemonSet pods restart:
Starting this date, the number of
nack
ed message continuously increasedAfter DaemonSet pods restart:
The whole logs pipeline (Sink, Pub/Sub Topic + Subscription) was still operational and ~8M messages stacked in the meanwhile. However, I may provide any additional info/logs if required.
Expected behavior The Promtail
gcplog
logs forwarding to continuously operate and consume messages when any message is available.Environment:
Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.
Promtail extra scrape config:
Thanks for your help