googleapis / nodejs-pubsub

Node.js client for Google Cloud Pub/Sub: Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics.
https://cloud.google.com/pubsub/
Apache License 2.0
518 stars 230 forks source link

Publishing fails after seemingly random interval, requiring application restart #1810

Closed nrheckman closed 11 months ago

nrheckman commented 12 months ago

After a period of successful message publishing, all requests will begin timing out until the application is restarted. Interval has been observed to be anywhere from 1 to 7 days.

App is running within a container in GKE. It is a nodejs application built from a Typescript project using the following versions:

# nodejs --version
v14.21.3

# npm --version
6.14.18

# head -n4 node_modules/\@google-cloud/pubsub/package.json 
{
  "name": "@google-cloud/pubsub",
  "description": "Cloud Pub/Sub Client Library for Node.js",
  "version": "4.0.1",

There are 300 instances of our application running at all times, and each individual instance generates approximately 35 to 45 messages per second. The messages are split between three topics, the most heavily published topic receives up to approximately 35 messages per second per instance. Total message rate for all instances combined is between 11,000 and 13,000 messages per second. Observed publication failures appear to impact only a single instance at random, and can occur half a dozen times or more throughout the day. It does not appear to be uptime-related, as some instances can run for a week before being impacted. Once an application instance begins failing to publish messages, it fails all message publish requests until it is restarted. Restart gets triggered by k8s due to memory consumption exceeding its maximum value. This can take 10 minutes or more as message publishing attemps back up and become queued.

The impact of a single application instance being in this state is that as many as 2,500 of our customers are unable to interact with their products until the instance they are communicating with is restarted. This is a significant degradation of service for these customers and is generating sizable support issues for us.

The errors reported by the Pub/Sub client all appear to indicate that the request has timed out. "Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received."

Other than this being observed on our production applications, I don't have a specific set of circumstances which can be used to reproduce it.

I have included the typescript source file, which is responsible for message publishing and error handling.

import {EventStreamPubSubConfig} from "../config";
import {eventStreamMetrics} from "../observability";
import {EventStream} from "./ievent_stream";
import {EventMessage} from "./event_message";

import * as bunyan from "bunyan";
import {PublishOptions, PubSub, Topic} from "@google-cloud/pubsub";

export interface EventStreamPubSubOptions {
    config: EventStreamPubSubConfig;
    log: bunyan;
    messageOrdering: boolean;
    topic: string;
    pubSub: PubSub;
}

export class EventStreamPubSub implements EventStream {
    private options: EventStreamPubSubOptions;
    private log: bunyan;
    private topic: Topic;

    constructor(options: EventStreamPubSubOptions) {
        this.options = options;
        this.log = options.log;
        this.topic = this.attachToTopic();
    }

    private attachToTopic(): Topic {
        const publishOptions: PublishOptions = {
            batching: {
                // We are intentionally processing 1 message at a time in queue::enqueue,
                // so we should tell the PubSub library to publish right away.
                maxMessages: 1
            },
            gaxOpts: {
                timeout: this.options.config.timeoutMillis
            },
            messageOrdering: this.options.messageOrdering,
        }

        if(this.options.config.retry.enabled) {
            publishOptions.gaxOpts!.retry = {
                backoffSettings: this.options.config.retry.backoffSettings
            }
        }

        this.log.info(`Creating publisher for topic ${this.options.topic} with options: ${JSON.stringify(publishOptions)}`)

        return this.options.pubSub.topic(this.options.topic, publishOptions);
    }

    async create(message: EventMessage): Promise<void> {
        const timer = eventStreamMetrics?.pubSubWriter.startTimer();
        let m = {
            data: Buffer.from(JSON.stringify(message.message)),
            orderingKey: message.orderingKey
        };
        return this.topic.publishMessage(m).then(
            (resp) => {
                this.log.trace("Publish successful: " + resp);
                timer?.({
                    result: "success",
                    topic: this.options.topic
                });
            },
            (reason) => {
                this.log.error("Publish failed: " + reason);
                timer?.({
                    result: "failure",
                    topic: this.options.topic
                });

                // Drop the current message and move on
                this.topic.resumePublishing(m.orderingKey);
            }
        );
    }
}
feywind commented 11 months ago

@nrheckman Can you try the latest version or try implementing the defaults from this PR in your PubSub constructor call?

https://github.com/googleapis/nodejs-pubsub/pull/1814

We've been seeing a bunch of these and have a belief that it may be fixed by enabling the grpc keepalives.

nrheckman commented 11 months ago

Looks like enabling GRPC keepalives fixed our problem! Thank you!