Open siyegen opened 11 months ago
https://github.com/vectordotdev/vector/issues/12608 is a related issue but that one was ostensibly fixed.
I don't think this is fixed, I'm still having the issue on my end. There's messages sitting in the topic but I'm getting:
Jan 16 00:00:36 example-host vector[1308027]: 2024-01-16T00:00:36.494380Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:00:36 example-host vector[1308027]: 2024-01-16T00:00:36.494457Z INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:02:13 example-host vector[1308027]: 2024-01-16T00:02:13.854800Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:02:13 example-host vector[1308027]: 2024-01-16T00:02:13.854857Z INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:04:00 example-host vector[1308027]: 2024-01-16T00:04:00.914161Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:04:00 example-host vector[1308027]: 2024-01-16T00:04:00.914213Z INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:05:55 example-host vector[1308027]: 2024-01-16T00:05:55.840448Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:05:55 example-host vector[1308027]: 2024-01-16T00:05:55.840507Z INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Jan 16 00:07:39 example-host vector[1308027]: 2024-01-16T00:07:39.554190Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service was unable to fulfill your request. Please try again. [code=8a75]", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
Jan 16 00:07:39 example-host vector[1308027]: 2024-01-16T00:07:39.554249Z INFO vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0
Hi! I'm running vector 0.36.0 and still have the same issue for the pub_sub source. @jszwedko do you need more info to reproduce?
@jszwedko I think the issue that this error comes from https://github.com/vectordotdev/vector/blob/master/src/sources/gcp_pubsub.rs#L719, which shouldn't raised as an error as done here https://github.com/vectordotdev/vector/blob/master/src/sources/gcp_pubsub.rs#L717. For me it should have some configurable backoff before raised as actual errror.
I'm not sure I see what you are saying @alexandrst88 . The code you are pointing at will result in a retry in either case, but stream errors are retried immediately to reduce interruption.
Unfortunately we haven't been able to dig into this one more yet.
@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic:
retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.
@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic:
retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.
Ah I see, so the issue is just the warning logs when retries happen?
@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic: retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.
Ah I see, so the issue is just the warning logs when retries happen?
for me yes.
@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic: retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched.
Ah I see, so the issue is just the warning logs when retries happen?
for me yes.
Makes sense. We have had complaints about retries being logged at the warn
level before. I'd be open to seeing them bumped down to debug
.
I went down a huge rabbit hole here, but it turns out this error gets thrown if the subscription has no events left to pull. It's extremely confusing to see an error message say message: "The service was unable to fulfill your request"
when in reality, the request was fulfilled, there's just no data to pull.
I don't know if updated pubsub libraries have addressed this. This seems like a relevant issue: https://github.com/googleapis/google-cloud-dotnet/issues/1505
Happy to provide any logs that helps debug or troubleshoot this. At the very least, these should be moved to debug if for no other reason than they're incredibly misleading
I went down a huge rabbit hole here, but it turns out this error gets thrown if the subscription has no events left to pull. It's extremely confusing to see an error message say
message: "The service was unable to fulfill your request"
when in reality, the request was fulfilled, there's just no data to pull.I don't know if updated pubsub libraries have addressed this. This seems like a relevant issue: googleapis/google-cloud-dotnet#1505
Happy to provide any logs that helps debug or troubleshoot this. At the very least, these should be moved to debug if for no other reason than they're incredibly misleading
Aha, interesting. Nice find. Agreed then, these log messages could be moved to debug to avoid confusion. Happy to see a PR for that if anyone is so motivated 🙏
I suspect something is breaking with Vector when pubsub topics send a very low volume of logs to a subscription. I don't really know how to prove it, though.
I currently have vector configured to pull from two separate pubsub subscriptions with an identical config, and it regularly stops pulling logs from the one that gets a low volume of logs.
You can see in this first screenshot, vector is running just fine with the topic that sends a higher, more regular volume of events.
However, in this subscription you can see that the un-ack'ed events are piling up and vector is no longer ack'ing them. Restarting the service is usually enough to get it going again, but something is definitely not right here.
I have debug logging enabled and there's absolutely no indication that anything is wrong.
If anyone has ideas on steps I could take to troubleshoot this further, I'm totally open to ideas.
Config:
sources:
vector_logs:
type: internal_logs
high-volume-sub:
type: gcp_pubsub
project: [redacted]
subscription: [redacted]
credentials_path: [redacted]
retry_delay_secs: 300
poll_time_seconds: 60
keepalive_secs: 30
ack_deadline_secs: 10
low-volume-sub:
type: gcp_pubsub
project: [redacted]
subscription: [redacted]
credentials_path: [redacted]
retry_delay_secs: 300
poll_time_seconds: 60
keepalive_secs: 30
ack_deadline_secs: 10
We are seeing something very similar in one of the pubsub source which pulls from a subscription which receives messages in bursts and no messages for long periods after that. The error is slightly different though :
2024-11-11T00:22:09.976361Z ERROR vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: NotFound, message: "Resource not found (resource=log-ingest).", details: [], metadata: MetadataMap { headers: {} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
As stated in the @clong-msec 's comment, things start to work when the service is restarted. ++ @bruceg
Yeah this is super frustrating. We ended up modifying the Vector service file to just restart every hour as a workaround, but something is definitely wrong with the PubSub source when pulling from bursty topics
A note for the community
Problem
We are using Vector to send events from our control plane, through a queue (SQS for AWS and Pub/Sub for GCP), where they go through a few transforms before going to a clickhouse sink. On startup, and for some time after, messages are picked up and sent as expected. However after some amount of time vector stops processing new messages. It stays in this state until it's restarted, where it goes through the whole cycle again.
Vector is running in Kubernetes and uses the helm chart to deploy it
vector tap --inputs-of "clickhouse" --outputs-of "metrics_events_queue" --interval 1 --limit 1500
to help debug, when it's working I can see events come through as expected (though due to how tap works I might miss one or two)internal_metrics
there is a single errorcomponent_errors: {error_code: failed_fetching_events, error_type: request_failed}
that shows up and from this point Vector does not seem to process anything from Pub/Sub until it's restarted. There is a corresponding error in the logs, which I've included below.Error:
While nothing is processed again after this, I've included the bit of the log after the error showing that it appears that it's started to pull again at the very end. Despite this there are no further messages read from Pub/Sub, but also no further errors. In fact there are debug log lines showing a token generation / stream pull restarting, but after the
The service was unable to fulfill your request. Please try again. [code=8a75]
error above there are no further occurrences of the token / restarting stream messages in the logs until vector is restarted (which was 8 hours in this particular case).Are there any additional ways to get more debug information out or some other metric that can help explain if this is an issue inside vector or something on our side?
Configuration
Version
0.34.1
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response