GCP PubSub Source: messages stop being picked up and do not resume until restart

siyegen commented 7 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We are using Vector to send events from our control plane, through a queue (SQS for AWS and Pub/Sub for GCP), where they go through a few transforms before going to a clickhouse sink. On startup, and for some time after, messages are picked up and sent as expected. However after some amount of time vector stops processing new messages. It stays in this state until it's restarted, where it goes through the whole cycle again.

Vector is running in Kubernetes and uses the helm chart to deploy it

Vector is on version 0.29.0 of the helm chart, which is version 0.34.1 (afaict)
I've ran vector tap --inputs-of "clickhouse" --outputs-of "metrics_events_queue" --interval 1 --limit 1500 to help debug, when it's working I can see events come through as expected (though due to how tap works I might miss one or two)
- Once it's no longer processing then no mater how many events are sent nothing shows up
Looking at internal_metrics there is a single error component_errors: {error_code: failed_fetching_events, error_type: request_failed} that shows up and from this point Vector does not seem to process anything from Pub/Sub until it's restarted. There is a corresponding error in the logs, which I've included below.

Error:

2023-12-15T15:38:23.837273Z DEBUG hyper::proto::h2::client: client request body error: error writing a body to connection: send stream capacity unexpectedly closed
2023-12-15T15:38:23.837348Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {"content-disposition": "attachment"} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
2023-12-15T15:38:23.837429Z  INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
2023-12-15T15:38:23.837472Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
2023-12-15T15:38:23.837492Z DEBUG Connection{peer=Client}: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
2023-12-15T15:38:23.837557Z DEBUG Connection{peer=Client}: rustls::common_state: Sending warning alert CloseNotify    
2023-12-15T15:38:24.839014Z DEBUG hyper::client::connect::dns: resolving host="pubsub.googleapis.com"
2023-12-15T15:38:24.855878Z DEBUG hyper::client::connect::http: connecting to 209.85.200.95:443
2023-12-15T15:38:24.857479Z DEBUG hyper::client::connect::http: connected to 209.85.200.95:443
2023-12-15T15:38:24.857518Z DEBUG rustls::client::hs: No cached session for DnsName("pubsub.googleapis.com")    
2023-12-15T15:38:24.857651Z DEBUG rustls::client::hs: Not resuming any session    
2023-12-15T15:38:24.858796Z DEBUG rustls::client::hs: Using ciphersuite TLS13_AES_256_GCM_SHA384    
2023-12-15T15:38:24.858820Z DEBUG rustls::client::tls13: Not resuming    
2023-12-15T15:38:24.858942Z DEBUG rustls::client::tls13: TLS1.3 encrypted extensions: [Protocols([ProtocolName(6832)])]    
2023-12-15T15:38:24.858971Z DEBUG rustls::client::hs: ALPN protocol is Some(b"h2")    
2023-12-15T15:38:24.859305Z DEBUG h2::client: binding client connection
2023-12-15T15:38:24.859329Z DEBUG h2::client: client connection bound
2023-12-15T15:38:24.859333Z DEBUG h2::codec::framed_write: send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
2023-12-15T15:38:24.859383Z DEBUG vector::sources::gcp_pubsub: Starting streaming pull.

While nothing is processed again after this, I've included the bit of the log after the error showing that it appears that it's started to pull again at the very end. Despite this there are no further messages read from Pub/Sub, but also no further errors. In fact there are debug log lines showing a token generation / stream pull restarting, but after the The service was unable to fulfill your request. Please try again. [code=8a75] error above there are no further occurrences of the token / restarting stream messages in the logs until vector is restarted (which was 8 hours in this particular case).

2023-12-15T14:39:41.493245Z DEBUG vector::sources::gcp_pubsub: New authentication token generated, restarting stream.
2023-12-15T14:39:41.493270Z DEBUG vector::sources::gcp_pubsub: Retrying immediately.
2023-12-15T14:39:41.493320Z DEBUG hyper::client::connect::dns: resolving host="pubsub.googleapis.com"
2023-12-15T14:39:41.493423Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) }
2023-12-15T14:39:41.493444Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Reset { stream_id: StreamId(1), error_code: CANCEL }
2023-12-15T14:39:41.493534Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
2023-12-15T14:39:41.493558Z DEBUG Connection{peer=Client}: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
2023-12-15T14:39:41.493603Z DEBUG Connection{peer=Client}: rustls::common_state: Sending warning alert CloseNotify    
2023-12-15T14:39:41.509014Z DEBUG hyper::client::connect::http: connecting to 142.250.1.95:443
2023-12-15T14:39:41.510065Z DEBUG hyper::client::connect::http: connected to 142.250.1.95:443
2023-12-15T14:39:41.510090Z DEBUG rustls::client::hs: No cached session for DnsName("pubsub.googleapis.com")    
2023-12-15T14:39:41.510197Z DEBUG rustls::client::hs: Not resuming any session    
2023-12-15T14:39:41.511469Z DEBUG rustls::client::hs: Using ciphersuite TLS13_AES_256_GCM_SHA384    
2023-12-15T14:39:41.511508Z DEBUG rustls::client::tls13: Not resuming    
2023-12-15T14:39:41.511689Z DEBUG rustls::client::tls13: TLS1.3 encrypted extensions: [Protocols([ProtocolName(6832)])]    
2023-12-15T14:39:41.511707Z DEBUG rustls::client::hs: ALPN protocol is Some(b"h2")    
2023-12-15T14:39:41.512041Z DEBUG h2::client: binding client connection
2023-12-15T14:39:41.512067Z DEBUG h2::client: client connection bound
2023-12-15T14:39:41.512071Z DEBUG h2::codec::framed_write: send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
2023-12-15T14:39:41.512142Z DEBUG vector::sources::gcp_pubsub: Starting streaming pull.

Are there any additional ways to get more debug information out or some other metric that can help explain if this is an issue inside vector or something on our side?

Configuration

customConfig:
  api:
    enabled: true
    address: 0.0.0.0:8686
    playground: true
  sources:
    vector_metrics:
        type: internal_metrics
    metrics_events_queue:
      type: gcp_pubsub
      project: <removed>
      subscription: <removed> 
      decoding:
        codec: json
  transforms:
    parse_timestamp:
      type: remap
      inputs: [metrics_events_queue]
      source: |
        .timestamp = to_unix_timestamp(from_unix_timestamp!(.timestamp, unit: "milliseconds"))
    only_backups:
      type: filter
      inputs:
        - parse_timestamp
      condition: |
        starts_with!(.kind, "Backup")
    flatten:
      type: remap
      inputs:
          - only_backups
      source: |
          .state = .event.state
          .size = .event.size
    only_maintenance:
      type: filter
      inputs:
        - parse_timestamp
      condition: |
        ends_with!(.kind, "Maintenance")
  sinks:
    prometheus:
      type: prometheus_exporter
      inputs: [vector_metrics]
      address: "0.0.0.0:9090"
    clickhouse:
      type: clickhouse
      inputs: [flatten]
      endpoint: <removed>
      database: default
      table: backup_events
      auth:
        ...
      acknowledgements:
        enabled: true
    clickhouse_maintenance:
      type: clickhouse
      inputs: [only_maintenance]
      endpoint: <removed>
      database: default
      table: maintenance_events
      auth:
        ...
      acknowledgements:
        enabled: true

Version

0.34.1

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 7 months ago

https://github.com/vectordotdev/vector/issues/12608 is a related issue but that one was ostensibly fixed.

clong-msec commented 6 months ago

I don't think this is fixed, I'm still having the issue on my end. There's messages sitting in the topic but I'm getting:

Jan 16 00:00:36 example-host vector[1308027]: 2024-01-16T00:00:36.494380Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:00:36 example-host vector[1308027]: 2024-01-16T00:00:36.494457Z  INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:02:13 example-host vector[1308027]: 2024-01-16T00:02:13.854800Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:02:13 example-host vector[1308027]: 2024-01-16T00:02:13.854857Z  INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:04:00 example-host vector[1308027]: 2024-01-16T00:04:00.914161Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:04:00 example-host vector[1308027]: 2024-01-16T00:04:00.914213Z  INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:05:55 example-host vector[1308027]: 2024-01-16T00:05:55.840448Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:05:55 example-host vector[1308027]: 2024-01-16T00:05:55.840507Z  INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:07:39 example-host vector[1308027]: 2024-01-16T00:07:39.554190Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:07:39 example-host vector[1308027]: 2024-01-16T00:07:39.554249Z  INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0

alexandrst88 commented 4 months ago

Hi! I'm running vector 0.36.0 and still have the same issue for the pub_sub source. @jszwedko do you need more info to reproduce?

alexandrst88 commented 4 months ago

@jszwedko I think the issue that this error comes from https://github.com/vectordotdev/vector/blob/master/src/sources/gcp_pubsub.rs#L719, which shouldn't raised as an error as done here https://github.com/vectordotdev/vector/blob/master/src/sources/gcp_pubsub.rs#L717. For me it should have some configurable backoff before raised as actual errror.

jszwedko commented 3 months ago

I'm not sure I see what you are saying @alexandrst88 . The code you are pointing at will result in a retry in either case, but stream errors are retried immediately to reduce interruption.

Unfortunately we haven't been able to dig into this one more yet.

alexandrst88 commented 3 months ago

@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic:

retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.

jszwedko commented 3 months ago

@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic:

retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.

Ah I see, so the issue is just the warning logs when retries happen?

alexandrst88 commented 3 months ago

@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic: retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.

Ah I see, so the issue is just the warning logs when retries happen?

for me yes.

jszwedko commented 3 months ago

@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic: retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.

Ah I see, so the issue is just the warning logs when retries happen?

for me yes.

Makes sense. We have had complaints about retries being logged at the warn level before. I'd be open to seeing them bumped down to debug.

clong-msec commented 3 months ago

I went down a huge rabbit hole here, but it turns out this error gets thrown if the subscription has no events left to pull. It's extremely confusing to see an error message say message: "The service was unable to fulfill your request" when in reality, the request was fulfilled, there's just no data to pull.

I don't know if updated pubsub libraries have addressed this. This seems like a relevant issue: https://github.com/googleapis/google-cloud-dotnet/issues/1505

Happy to provide any logs that helps debug or troubleshoot this. At the very least, these should be moved to debug if for no other reason than they're incredibly misleading

jszwedko commented 3 months ago

I went down a huge rabbit hole here, but it turns out this error gets thrown if the subscription has no events left to pull. It's extremely confusing to see an error message say message: "The service was unable to fulfill your request" when in reality, the request was fulfilled, there's just no data to pull.

I don't know if updated pubsub libraries have addressed this. This seems like a relevant issue: googleapis/google-cloud-dotnet#1505

Happy to provide any logs that helps debug or troubleshoot this. At the very least, these should be moved to debug if for no other reason than they're incredibly misleading

Aha, interesting. Nice find. Agreed then, these log messages could be moved to debug to avoid confusion. Happy to see a PR for that if anyone is so motivated 🙏

clong-msec commented 3 months ago

I suspect something is breaking with Vector when pubsub topics send a very low volume of logs to a subscription. I don't really know how to prove it, though.

I currently have vector configured to pull from two separate pubsub subscriptions with an identical config, and it regularly stops pulling logs from the one that gets a low volume of logs.

You can see in this first screenshot, vector is running just fine with the topic that sends a higher, more regular volume of events.

However, in this subscription you can see that the un-ack'ed events are piling up and vector is no longer ack'ing them. Restarting the service is usually enough to get it going again, but something is definitely not right here.

I have debug logging enabled and there's absolutely no indication that anything is wrong.

If anyone has ideas on steps I could take to troubleshoot this further, I'm totally open to ideas.

Config:

sources:
  vector_logs:
    type: internal_logs

high-volume-sub:
    type: gcp_pubsub
    project: [redacted]
    subscription: [redacted]
    credentials_path: [redacted]
    retry_delay_secs: 300
    poll_time_seconds: 60
    keepalive_secs: 30
    ack_deadline_secs: 10

low-volume-sub:
    type: gcp_pubsub
    project: [redacted]
    subscription: [redacted]
    credentials_path: [redacted]
    retry_delay_secs: 300
    poll_time_seconds: 60
    keepalive_secs: 30
    ack_deadline_secs: 10

vectordotdev / vector