[promtail/bug?] gcplog subscription seems to hang/close without restart

julien-sugg commented 2 years ago

Greetings,

Describe the bug GCP Logs forwarding appears to randomly hang and doesn't perform any retry.

To Reproduce I don't have any determinist way to reproduce the issue yet, however I'll try to provide as much details as possible.

Promtail is running in a 3 nodes test cluster using the Grafana Promtail helm chart, and it suddenly stopped working 7 days ago (only the GCP Logs stopped flowing. The file based logs for containers running within the nodes were still successfully retrieved). I was able to "reproduce" it when I restarted all the Pods within the DaemonSet and it occurred again for one of them, however it may be also related to the rate limitations (not sure).

Before the DaemonSet pods restart:

level=error ts=2021-12-09T13:12:03.158782097Z caller=target.go:112 error="rpc error: code = Unknown desc = closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR, debug data: max_age"
level=error ts=2021-12-09T16:23:48.869152898Z caller=filetarget.go:265 msg=" failed to remove directory from watcher" error="can't remove non-existent inotify watch for: /var/log/pods/redacted"
level=error ts=2021-12-09T16:23:48.870090385Z caller=filetarget.go:265 msg=" failed to remove directory from watcher" error="can't remove non-existent inotify watch for: /var/log/pods/redacted"
level=error ts=2021-12-09T16:23:49.154940857Z caller=filetarget.go:265 msg=" failed to remove directory from watcher" error="can't remove non-existent inotify watch for: /var/log/pods/redacted
level=error ts=2021-12-09T16:33:36.171492947Z caller=target.go:112 error="rpc error: code = Unknown desc = closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR, debug data: server_shutting_down"

Note that I stripped the logs a bit to exclude non-info logs

Starting this date, the number of nacked message continuously increased

After DaemonSet pods restart:

level=warn ts=2021-12-15T14:24:42.153553384Z caller=client.go:344 component=client host=loki.monitoring.svc:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '437' lines totaling '997485' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=error ts=2021-12-15T14:36:19.171612121Z caller=target.go:112 error="rpc error: code = Unknown desc = closing transport due to: connection error: desc = \"error reading from server: read tcp 192.168.4.218:48210->173.194.76.95:443: read: connection reset by peer\", received prior goaway: code: NO_ERROR, debug data: max_age"

Note that it was running non-stop since september 2021

The whole logs pipeline (Sink, Pub/Sub Topic + Subscription) was still operational and ~8M messages stacked in the meanwhile. However, I may provide any additional info/logs if required.

Expected behavior The Promtail gcplog logs forwarding to continuously operate and consume messages when any message is available.

Environment:

Infrastructure: Kubernetes via GKE (v1.20.10-gke.1600)
Deployment tool: helmfile v0.141.0 + helm v3.7.1 + kubectl v1.21.5
Loki docker image: docker.io/grafana/loki:2.3.0
Promtail docker image: docker.io/grafana/promtail:2.3.0

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

Promtail extra scrape config:

...
  extraScrapeConfigs: |
      - job_name: gcp-logs
        gcplog:
          project_id: {{ .Values.promtail_google_project_id | quote }}
          subscription: {{ .Values.promtail_google_pubsub_subscription | quote }}
          # @see https://github.com/grafana/loki/issues/3492
          use_incoming_timestamp: false
          labels:
            job: "gcplog"
            # @see https://github.com/grafana/loki/issues/3492
            promtail_instance: '${POD_NAME}'
        relabel_configs:
          - source_labels: ['__project_id']
            target_label: 'project'
...

Thanks for your help

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

julien-sugg commented 2 years ago

up

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

cstyan commented 10 months ago

@julien-sugg is this still an issue for you?

vbarbarosh commented 10 months ago

I experience the same issue. Any progress on it?

promtail_1  | level=error ts=2023-11-14T01:14:26.420779961Z caller=target.go:111 msg="failed to receive pubsub messages" error="rpc error: code = InvalidArgument desc = Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.\nerror details: name = ErrorInfo reason = EXACTLY_ONCE_ACKID_FAILURE domain = pubsub.googleapis.com metadata = map[xxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID xxxxxxxxxxxxxxx:PERMANENT_FAILURE_INVALID_ACK_ID]"

I noticed that after first gcp error, promtail stops sending any logs. I have to restart it each day.

vbarbarosh commented 10 months ago

It seems that promtail stops sending all logs after hitting first gcp error. Here is my config:

server:
  disable: true

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://main:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      host: xx.statuspage
      promtail: xx.statuspage
      __path__: /var/log/*log
- job_name: gcplog
  gcplog:
    project_id: xxxxxxxx
    subscription: grafana-subscription
    use_incoming_timestamp: false # default rewrite timestamps.
    labels:
      job: gcplog
  relabel_configs:
    - source_labels: ['__gcp_resource_type']
      target_label: 'resource_type'
    - source_labels: ['__gcp_resource_labels_project_id']
      target_label: 'project'

Initially I thought that only gcp logs are affected, but after a first gcp error I see no logs from this node at all.

julien-sugg commented 10 months ago

@julien-sugg is this still an issue for you?

Hi, all my apologies, I totally forgot about this pending topic.

I am not working on GCP anymore, hence I cannot provide you any update.

vbarbarosh commented 10 months ago

My current solution to the problem is to add crontab job * * * * * /.../bin/restart-if-gcp-failed

#!/bin/bash
# http://www.gnu.org/software/bash/manual/bash.html#The-Set-Builtin
# http://redsymbol.net/articles/unofficial-bash-strict-mode/

set -o nounset -o errexit -o pipefail

script=`realpath $0`
scriptdir=`dirname $script`
scriptname=`basename $script`

cd $scriptdir/..

if docker-compose logs | grep error; then
    docker-compose restart
    date >> restart
fi

grafana / loki

[promtail/bug?] gcplog subscription seems to hang/close without restart #4935