DEADLINE_EXCEEDED makes application not receiving messages at all

mahaben commented 5 years ago

Environment details

Node.js version: v12.7.0 npm version: 6.10.0 @google-cloud/pubsub version: "^1.0.0",

Error:

insertId: "gnr3q1fz7eerd" jsonPayload: { level: "error" message: "unhandledRejection" originalError: { ackIds: [1] code: 4 details: "Deadline exceeded" } }

After receiving this error, the app does not receive messages anymore and we have to exit the application to recreate the kubernetes pod.

Any help would be appreciated!

npomfret commented 5 years ago

@callmehiphop is there anything in particular?

I'm running a node app with (some node 11 version) on a basic GCE instance (f1-micro). It's been running fine for a long time and at 16:35:28 Oct 09 (UTC) i started getting a variety of pubsub errors (listed above). The impact is that most of my subscribers aren't receiving any messages. Sending seems to be ok, I can see a build up of messages via the console. I've had a few messages consumed after a restart but it fails pretty quickly after that (like in a few seconds).

MichaelMarkieta commented 5 years ago

@npomfret 's experience would line up with my attempt to replicate the issue in GAE. However, since the instances in GAE are restarting or new instances come online when the service is unhealthy this deployment method is impacted slightly less.

callmehiphop commented 5 years ago

@npomfret could you tell me a little more about your subscribers, like

How many messages would you approximate are in the queue?
What kind of flow control settings do you have configured?
What is your ack deadline set to?
How many subscription instances are you creating?
How long on average does it take to ack a message?

MichaelMarkieta commented 5 years ago

Note: I have a playground environment that I can tune using Locust to simulate load.

GAE Standard publisher service
GAE Standard puller service
Locust on GCE putting load on (1) publisher and (2) puller

How many messages would you approximate are in the queue? This doesn't matter, errors present at ~50 messages per second or ~500 messages per second
What kind of flow control settings do you have configured? I am loading GAE Standard with Locust to have an appropriate number of instances to pull messages at a rate that matches the publisher.
What is your ack deadline set to? Defaults
How many subscription instances are you creating? Depends on the load generated on the publisher service (for 500 mps to PubSub, I have 5 GAE Standard F4 instances in the puller service)
How long on average does it take to ack a message? < 1 second, until it doesn't

gae123 commented 5 years ago

Here are my observations with latest libraries, in case they help anybody to reproduce. Rest of the stack same as I described in other messages in this thread.

We have a regional k8s cluster with 3 pods---if you do not understand k8s these are basically three nodejs processes running the same code. Each pod receives messages from the same subscription, about 10 messages arrive almost simultaneously at the beginning of each minute. Before latest changes, and this issue, ie when things were working fine, this was load balancing arriving messages among the three pods perpetually, no errors observed, each message was randomly processed by one pod/instance.

Here is the reconstruction of events from the log, similar to what we have observed the past few days:

2019-10-09, 20:46:43: cluster restarts with 3 instances: f746s,7pdnd,tx7lq
Confirmed that all three pods are receiving and processing pubsub messages
2019-10-09 21:52:00.759 Last message successfully received by 7pdnd

2019-10-09 21:52:41.470 pod 7pdnd receives errors in multiple subscriptions:

at MessageStream._waitForClientReady (/mgmt/node_modules/@google-cloud/pubsub/build/src/message-stream.js:318:19)
code: 4,
details: 'Failed to connect before the deadline',
metadata: Metadata { internalRepr: Map {}, options: {} } }"

Around the same time all other subscriptions report errors in the same pod No message ever received again by this pod

2019-10-10 03:21:00.836 Last message ever pulled by pod tx7lq

2019-10-10 03:22:41.506 pod tx7lq receives error in subscription


at MessageStream._waitForClientReady (/mgmt/node_modules/@google-cloud/pubsub/build/src/message-stream.js:318:19)
code: 4,
details: 'Failed to connect before the deadline',
metadata: Metadata { internalRepr: Map {}, options: {} } }"``` 
Around the same time many other subscriptions report errors in the same pod

2019-10-10 12:40:01.138 Right now all messages go to only to f746s which is the only instance receiving messages, when the same happens to it customers start calling.... @bcoe I am not rebooting the deployment because I am also testing your fix for https://github.com/googleapis/nodejs-logging/issues/596 here and want to let it run as much as possible....
2019-10-11 09:48:00 No change, the one pod is still fine handling all incoming messages...
2019-10-11 09:52:01 Third pod also stops receiving messages
2019-10-11 09:57:19.276 Similar timeout errors appear in the logs for the third pod
2019-10-11 21:36 I am restarting the pods...

This whole story is consistent to what we have seen the past few days.

bcoe commented 5 years ago

@gae123 there's a chance we have a reproduction, we're working towards testing on a few @grpc/grpc-js versions, so that we can better isolate where the breakage occurs. Will keep you updated.

redging-very-well commented 5 years ago

I've hit this issue too - for now, I'm using this workaround:


const initSubscriber = () => {
  const pubsub = new PubSub();
  const subscription = pubsub.subscription(topic, options);
  subscription.on('message', handler.handleMessage);
  subscription.on('error', e => {
    console.error('Subscriber error handler caught exception. Restarting subscription.', e);
    initSubscriber();
  });
};

initSubscriber();

bcoe commented 5 years ago

@Redgwell what environment are you running PubSub in (Kubernetes Engine? App Engine?).

redging-very-well commented 5 years ago

@bcoe Running in GKE.

"@google-cloud/pubsub": "^1.1.0",

bcoe commented 5 years ago

@Redgwell a couple more follow up questions:

how many messages/minute is your subscriber handling would you guesstimate?
how often are you seeing the subscriber issue in your logs?

redging-very-well commented 5 years ago

@bcoe

Really low throughput. Probably less than 10 a day at this point.
We just released our service last night, and when we checked the logs this morning, we saw this error. The service was happy for about 3-4 hours before the error showed up in the logs.

I did a quick google and landed here. Locally, I was able to replicate the same error message and behaviour (that is, no longer receiving messages after the error) by:

Start app, subscribe, verify messages are being received.
Disconnect from internet, and wait for the deadline exceeded error
Reconnect to internet, publish messages

To make this a little faster to replicate, I set the streamingOptions timeout value in the pubsub options object to have a shorter streaming connection timeout:

   streamingOptions: {
      timeout: 5000
    }

One thing that might be useful for fixing this issue - if you look at my workaround, inside the initSubscriber function, I create a new PubSub() object. I found that if I didn't do this inside the initSubscriber function, and instead did it just once when the app starts, the application just kept getting deadline exceeded messages over and over again, even once I'd reconnected to the internet.

That makes me thing there is some state being set in the PubSub object that puts it in a faulted state until it's recreated. Sorry, I don't have the time right now to dig into that assertion any further.

bcoe commented 5 years ago

@Redgwell thank you this is very thorough, and gives us a good starting point for debugging.

bcoe commented 5 years ago

@Redgwell, could I bother you to test with @grpc/grpc-js@0.6.8, if you delete your node_modules and package-lock.json, and run npm install @google-cloud/pubsub, you should get it now. (you can run npm ls post install to confirm).

We believe we've potentially addressed the bug requiring you to resubscribe.

redging-very-well commented 5 years ago

Okay, I grabbed the latest as you said - here's the pubsub dependency tree from npm ls:

├─┬ @google-cloud/pubsub@1.1.2
│ ├─┬ @google-cloud/paginator@2.0.1
│ │ ├── arrify@2.0.1 deduped
│ │ └── extend@3.0.2 deduped
│ ├── @google-cloud/precise-date@1.0.1
│ ├── @google-cloud/projectify@1.0.1
│ ├── @google-cloud/promisify@1.0.2
│ ├── @sindresorhus/is@1.2.0
│ ├─┬ @types/duplexify@3.6.0
│ │ └── @types/node@12.7.12
│ ├── @types/long@4.0.0
│ ├── arrify@2.0.1
│ ├── async-each@1.0.3
│ ├── extend@3.0.2
│ ├─┬ google-auth-library@5.4.1
│ │ ├── arrify@2.0.1 deduped
│ │ ├── base64-js@1.3.1
│ │ ├── fast-text-encoding@1.0.0
│ │ ├─┬ gaxios@2.0.1
│ │ │ ├── abort-controller@3.0.0 deduped
│ │ │ ├── extend@3.0.2 deduped
│ │ │ ├─┬ https-proxy-agent@2.2.2
│ │ │ │ ├─┬ agent-base@4.3.0
│ │ │ │ │ └─┬ es6-promisify@5.0.0
│ │ │ │ │   └── es6-promise@4.2.8
│ │ │ │ └── debug@3.2.6 deduped
│ │ │ └── node-fetch@2.6.0 deduped
│ │ ├─┬ gcp-metadata@3.2.0
│ │ │ ├── gaxios@2.0.1 deduped
│ │ │ └─┬ json-bigint@0.3.0
│ │ │   └── bignumber.js@7.2.1
│ │ ├─┬ gtoken@4.1.0
│ │ │ ├── gaxios@2.0.1 deduped
│ │ │ ├─┬ google-p12-pem@2.0.2
│ │ │ │ └── node-forge@0.9.1
│ │ │ ├── jws@3.2.2 deduped
│ │ │ └── mime@2.4.4
│ │ ├─┬ jws@3.2.2
│ │ │ ├─┬ jwa@1.4.1
│ │ │ │ ├── buffer-equal-constant-time@1.0.1
│ │ │ │ ├─┬ ecdsa-sig-formatter@1.0.11
│ │ │ │ │ └── safe-buffer@5.2.0 deduped
│ │ │ │ └── safe-buffer@5.2.0 deduped
│ │ │ └── safe-buffer@5.2.0 deduped
│ │ └─┬ lru-cache@5.1.1
│ │   └── yallist@3.1.1
│ ├─┬ google-gax@1.7.4
│ │ ├─┬ @grpc/grpc-js@0.6.8
│ │ │ └── semver@6.3.0 deduped
│ │ ├─┬ @grpc/proto-loader@0.5.2
│ │ │ ├── lodash.camelcase@4.3.0
│ │ │ └── protobufjs@6.8.8 deduped
│ │ ├─┬ abort-controller@3.0.0
│ │ │ └── event-target-shim@5.0.1
│ │ ├─┬ duplexify@3.7.1
│ │ │ ├─┬ end-of-stream@1.4.4
│ │ │ │ └── once@1.4.0 deduped
│ │ │ ├── inherits@2.0.4 deduped
│ │ │ ├── readable-stream@2.3.6 deduped
│ │ │ └── stream-shift@1.0.0
│ │ ├── google-auth-library@5.4.1 deduped
│ │ ├── is-stream-ended@0.1.4 deduped
│ │ ├── lodash.at@4.6.0
│ │ ├── lodash.has@4.5.2
│ │ ├── node-fetch@2.6.0
│ │ ├── protobufjs@6.8.8 deduped
│ │ ├─┬ retry-request@4.1.1
│ │ │ ├─┬ debug@4.1.1
│ │ │ │ └── ms@2.1.2 deduped
│ │ │ └─┬ through2@3.0.1
│ │ │   └── readable-stream@2.3.6 deduped
│ │ ├── semver@6.3.0 deduped
│ │ └── walkdir@0.4.1
│ ├── is-stream-ended@0.1.4
│ ├── lodash.snakecase@4.1.1
│ ├── p-defer@3.0.0
│ └─┬ protobufjs@6.8.8
│   ├── @protobufjs/aspromise@1.1.2
│   ├── @protobufjs/base64@1.1.2
│   ├── @protobufjs/codegen@2.0.4
│   ├── @protobufjs/eventemitter@1.1.0
│   ├─┬ @protobufjs/fetch@1.1.0
│   │ ├── @protobufjs/aspromise@1.1.2 deduped
│   │ └── @protobufjs/inquire@1.1.0 deduped
│   ├── @protobufjs/float@1.0.2
│   ├── @protobufjs/inquire@1.1.0
│   ├── @protobufjs/path@1.1.2
│   ├── @protobufjs/pool@1.1.0
│   ├── @protobufjs/utf8@1.1.0
│   ├── @types/long@4.0.0 deduped
│   ├── @types/node@10.14.21
│   └── long@4.0.0

I tried my test loop again - with this code:


const initSubscriber = () => {
  const pubsub = new PubSub();
  const subscription = pubsub.subscription(topic, options);
  subscription.on('message', handler.handleMessage);
  subscription.on('error', e => {
    console.error('Subscriber error handler caught exception. Restarting subscription.', e);
    //initSubscriber(); <------ commented out this, because I'm expecting that the fix means we don't need to do this anymore right?
  });
};

initSubscriber();

So with that running, I pushed some messages, saw the the app was handling them, then disconnected wifi and waited for the deadline exceeded error. I reconnected wifi, pushed some more messages and waited for about 2 minutes. The app didn't process the messages. After restarting the app, it immediately processed the messages.

So unfortunately, it doesn't seem that this has fixed things, at least for my test loop of disconnecting the internet connection.

pworkpop commented 5 years ago

We found one service that happily didn't observe these failures and we are now downgrading our affected services to this version until a confirmed working solution is provided: @google-cloud/pubsub: 0.30.1 google-gax: 1.1.4 @grpc/grpc-js: ^0.4.3

You have to pin the google-gax version as it would pick the latest 1.x.x (check your lock file)

  "resolutions": {
    "google-gax": "1.1.4"
  }

We are also adding the initSubscriber workaround suggested by @Redgwell - it does seem to help. Also adding initPublisher to recreate the PubSub client in case publish fails with code 4.

npomfret commented 5 years ago

SMP Debian 4.9.144-3.1 (2019-02-19) x86_64
~$ nvm --version
0.34.0
~$ npm --version
6.7.0
~$ yarn --version
1.19.1
~$ node --version
v11.15.0

How many messages would you approximate are in the queue?

The queues are normally empty, some messages take a few hundred seconds to process but most happen within a second or so. Sometimes the subscribers have maybe 10 messages backed up. These are not busy queues.

What kind of flow control settings do you have configured?

I have two types of subscriber. The first has flowControl: { maxMessages: 1, allowExcessMessages: false }, the second doesn't have any flow control settings. The both suffer from the problems described.

What is your ack deadline set to?

5 minutes for both.

How many subscription instances are you creating?

Each topic has one subscription instance, and each subscription has one nodejs consumer. I have a hundred or so in total.

How long on average does it take to ack a message?

It varies greatly. The first type (with flow control settings) can take a long time to process and then ack a message, like over 100 seconds, but usually it's around 1 second. The 2nd type (with no flow control) are ack'd immediately, so as soon as the message arrives it's ack'd.

npomfret commented 5 years ago

@pworkpop pinning the version of google-gax to 1.1.4 didn't work in my case. The key management api blew up:

nodeapp     at new KeyManagementServiceClient (node_modules/@google-cloud/kms/src/v1/key_management_service_client.js:130:28)
nodeapp     at GrpcClient.loadProto (node_modules/google-gax/build/src/grpc.js:145:21)
nodeapp     at GrpcClient.loadFromProto (node_modules/google-gax/build/src/grpc.js:118:44)
nodeapp     at Object.loadSync (node_modules/@grpc/proto-loader/build/src/index.js:230:27)
nodeapp     at Root.loadSync (node_modules/protobufjs/src/root.js:235:17)
nodeapp     at Root.load (node_modules/protobufjs/src/root.js:192:44)
nodeapp TypeError: Cannot read property 'length' of undefined

Same error with google-logging and a bunch of others. I see this in the logs:

nodeapp warning Resolution field "google-gax@1.1.4" is incompatible with requested version "google-gax@^1.6.3"

... however, I think I have a working app now by downgrading a whole bunch of stuff. Don't know what the magic change was in the end but I'm running the following:

"@google-cloud/kms": "1.3.2",
"@google-cloud/logging": "5.2.2",
"@google-cloud/logging-bunyan": "^1.2.3",
"@google-cloud/pubsub": "0.29.1",
"@google-cloud/storage": "3.3.0",

xoraingroup commented 5 years ago

@npomfret I think this "@google-cloud/pubsub": "0.29.1", has done magic. I have had the same issue. Only 0.29.1 is the working version.

npomfret commented 5 years ago

Rolling back to 0.29.1 on it's own was definitely not enough in my case. I had to rollback a lot of the other google dependencies before it started working. I will gradually start to roll forward to see if I can find the culprit.

bcoe commented 5 years ago

@pwrkpop @npomfret @xoraingroup, rather than rolling back to 0.29.1 of @google-cloud/pubsub I recommend one of the following workarounds:

Using grpc, rather than @grpc/grpc-js

inside your project which uses @google-cloud/pubsub, add the dependency grpc (this is the old gRPC transport layer).

use the following code when creating your library:

const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});

:point_up: this same approach can be used for other libraries that use gRPC, e.g.,

const grpc = require('grpc');
const {Logging} = require('@google-cloud/logging');
const logger = new Logging({grpc});

Using the workaround recommended by @Redgwell


const initSubscriber = () => {
  const pubsub = new PubSub();
  const subscription = pubsub.subscription(topic, options);
  subscription.on('message', handler.handleMessage);
  subscription.on('error', e => {
    initSubscriber();  });
};

initSubscriber();

We potentially have a reproduction of the issue described in this thread (thanks @Redgwell for pointing us in the right direction), and will hopefully have a fix out soon that makes either of these workaround unnecessary.

rvillane commented 5 years ago

will try the first workaround (using the old gRPC transport layer) and will report results soon.

MichaelMarkieta commented 5 years ago

I am not seeing any errors using this method for more than 6 hours:

const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});

xoraingroup commented 5 years ago

@bcoe Seems your trick worked. I have made some tests that are running last 24 hours and no such issue appeared. When can we expect a permanent fix than the above work around?

pworkpop commented 5 years ago

google-gax: 1.1.4 @grpc/grpc-js: 0.4.3 seem to be stable for us - we have those in production since May

@npomfret sorry about that, we only use pubsub (no kms & logging libs). "@google-cloud/pubsub": "0.29.1" uses built-in grpc as @grpc/grpc-js was added in "@google-cloud/pubsub": "0.30.0" so bcoe's workaround is effectively the same - use built-in.

I sure wish the @grpc/grpc-js devs write more tests before pushing changes to prod.

gberth commented 5 years ago

@pwrkpop @npomfret @xoraingroup, rather than rolling back to 0.29.1 of @google-cloud/pubsub I recommend one of the following workarounds:

Using grpc, rather than @grpc/grpc-js

inside your project which uses @google-cloud/pubsub, add the dependency grpc (this is the old gRPC transport layer).

use the following code when creating your library:
const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});
this same approach can be used for other libraries that use gRPC, e.g.,
const grpc = require('grpc');
const {Logging} = require('@google-cloud/logging');
const logger = new Logging({grpc});
Using the workaround recommended by @Redgwell
const initSubscriber = () => {
  const pubsub = new PubSub();
  const subscription = pubsub.subscription(topic, options);
  subscription.on('message', handler.handleMessage);
  subscription.on('error', e => {
    initSubscriber();  });
};

initSubscriber();
We potentially have a reproduction of the issue described in this thread (thanks @Redgwell for pointing us in the right direction), and will hopefully have a fix out soon that makes either of these workaround unnecessary.

Cant find anyting on creating PubSub with options includeing grpc? I use projectId, and tried to add grpc, but typescript rejects this.

https://googleapis.dev/nodejs/pubsub/1.1.1/PubSub.html

gberth commented 5 years ago

A peculiar regularity shows here. Every hour it seems that the pubsub stops for a peridod of approximately 5 minutes. Not seen the regularity like this until today. The node app runs as normal, other type of events are showing that. Running the latest version og node pub-sub.

Screenshot from 2019-10-15 13-49-40

RaptDept commented 5 years ago

@bcoe I wanted to confirm the workaround, so I did a quick scan of the source. Neither does the PubSub constructor nor the google_auth_library_1.GoogleAuth constructor inside of it seem to consume the grpc object that's passed into them. I'm not sure how this would work. This applies to versions 0.30.3 (the version our project is currently using) and the latest 1.1.2.

RaptDept commented 5 years ago

My bad, I see it now. It gets passed to GrpcClient in google-gax eventually!

bcoe commented 5 years ago

@gberth (as you noticed @RaptDept, the ClientOptions accepted by new PubSub extends gax.GrpcClientOptions), which is where the grpc parameter is defined:

export interface ClientConfig extends gax.GrpcClientOptions {
  apiEndpoint?: string;
  servicePath?: string;
  port?: string | number;
  sslCreds?: ChannelCredentials;
}

export interface GrpcClientOptions extends GoogleAuthOptions {
  auth?: GoogleAuth;
  promise?: PromiseConstructor;
  grpc?: GrpcModule;
}

I've tested setting grpc and can confirm it toggles the transport layer.

bcoe commented 5 years ago

I've downgraded this to p2, not because it's not top of mind, but because the grpc workaround seems to be working for most folks. We're continuing to dig into things on our end, and are attempting to figure out the root causes of the issues we've seen with @grpc/grpc-js.

MichaelMarkieta commented 5 years ago

Have issues been logged in all other client libraries that depend on grpc-js that are affected, for tracking purposes, and to ensure all users can find this bug if they're searching online for help?

MatthieuLemoine commented 5 years ago

@bcoe Why not releasing a new version using the grpc workaround? Every developer upgrading @google-cloud/pubsub will encounter this issue.

This issue is hard to catch on dev environments as you have to wait one hour for the error to be triggered. Therefore we can assume that at least some of them will end up pushing broken code to production.

bcoe commented 5 years ago

@MatthieuLemoine @MichaelMarkieta I have been running a PubSub consumer for 4d15h now, on Kubernetes Engine, without a single issue.

NAME                          READY   STATUS    RESTARTS   AGE
good-reader-8f5fbb755-jbf28   1/1     Running   0          4d15h

This is an issue hitting a percentage of our users, but is not hitting 100% of library users, and we are continuing to attempt to find the actual root cause.

This is why we haven't opted to switch the grpc dependency.

ericpearson commented 5 years ago

This is an issue hitting a percentage of our users, but is not hitting 100% of library users

this comment makes no sense from optics and I totally agree with @MatthieuLemoine release a proper fix for this or one with this 'workaround' built in, asking customers to change their production code with some speculative 'fix' is irresponsible

what happens when this is actually fixed and this ends up causing more problems later

the updated documentation does not even mention what kind of workloads would be better to use native or not making the mere suggestion of using it even more confusing and bug prone potentially

gae123 commented 5 years ago

This has been a workaround for us too, no issues in the last 24 hours. I ll keep monitoring....

const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});

bcoe commented 5 years ago

Asking customers to change their production code with some speculative 'fix' is irresponsible

We're very much trying to avoid this, I realize how frustrating the string of patches to @grpc/grpc-js was last week. Which is why we've taken a step back, and asked folks to opt for the grpc library instead, since we've seen this consistently address the issues that people are experiencing.

Our libraries were designed to allow grpc as an alternative to @grpc/grpc-js, specifically in case a situation like this arose (where we saw inconsistencies between the two libraries impacting users).

Now, even though we are advising that folks running into immediate issues switch to grpc,

We are continuing to try to ascertain a consistent reproduction of the issue effecting users. What has made this difficult, is that we see the default configuration (with @grpc/grpc-js) running effectively for many people -- I've had a cluster running for 5 days now, without the behavior reported, and we haven't been able to recreate the issue consistently with various attempts at stress testing in various environments.

Rather than continuing to float patches to @grpc/grpc-js speculatively (which I agree is irresponsible), we are taking the following approach:

we have been asking people for detailed information about their runtime environments, and are looking for patterns.
we have also been asking people to run with the environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG, which is allowing us to deliver to the gRPC team the info they need to figure out what the heck is happening to some people.
If we're not happy that we've reached a reasonable resolution soon, we will consider switching the default library back to grpc.

gberth commented 5 years ago

Have run with grpc since 10.30 AM today. App. 400' msgs read. No errors, no hangups. For information - messages like these below (had app 200 of them during 46 hours) has also disappeared

(node:16) Error: Failed to add metadata entry A...: Mon, 14 Oct 2019 11:40:19 GMT. Metadata key "a..." contains illegal characters

alexander-fenster commented 5 years ago

@gberth Thanks for letting us know! Just to give you some details, the metadata warnings are caused not by @grpc/grpc-js but by some (unknown) bug in Node.js http2 implementation. The bug is here https://github.com/nodejs/node/issues/28632 for tracking purposes but no useful debug info is there yet.

grpc uses its own http2 stack, so the Node.js http2 module bug does not affect it - that's why those messages disappeared.

yanzit commented 5 years ago

just adding some observations in case it helps. have two services both running pubsub 1.0.0. One is older, and have no issues, one is newer and have the issue. have not tried to roll back the new service, using the workaround for now.

Here are the differences when running "npm ls @grpc/grpc-js"

old: ├─┬ @google-cloud/logging@4.0.1 │ └─┬ google-gax@0.20.0 │ └── @grpc/grpc-js@0.2.0 ├─┬ @google-cloud/pubsub@1.0.0 │ ├── @grpc/grpc-js@0.5.4 deduped │ └─┬ google-gax@1.6.0 │ └── @grpc/grpc-js@0.5.4 deduped └── @grpc/grpc-js@0.5.4

new └─┬ @google-cloud/pubsub@1.0.0 ├── @grpc/grpc-js@0.5.4 └─┬ google-gax@1.6.4 └── @grpc/grpc-js@0.6.6

pworkpop commented 5 years ago

@bcoe We run on GKE using credentials passed to PubSub and GoogleAuth (not the GKE service account). Most services run 0.30.1 with google-gax: 1.1.4 & @grpc/grpc-js: 0.4.3 without issues since 0.30.1 was released in Jun. One service was updated to google-gax: 1.6.4 & @grpc/grpc-js: 0.6.6 started observing the issues last week. We have this solution for token auto-refresh also

this._client = new PubSub({ ...queueConfig, auth: new GoogleAuth(queueConfig) });

Issue doesn't start exactly one hour after start for us so I suspect it is related to connectivity to pubsub servers and how those are rolled in and out of service for updates. Re-creating the PubSub client (thus the underlying channel) definitely solves this, so I suggest adding this fix to pubsub itself. We occasionally see deadline exceeded messages in the old services running @grpc/grpc-js: 0.4.3 but those seem to recover all right. For people that see this happen every hour it may be related to token expiry - unable to refresh the token once it expires - we used to have unauthorised errors without passing that auth parameter.

ringzhz commented 5 years ago

I'd like to :+1: this going back to a P1. Providing a workaround isn't an acceptable response. Indeed, we have implemented the workaround and rolled out to Prod to see that while it successfully mitigated the lost pub/sub connection, it also introduces a memory leak that requires periodic restart of our k8s pods regardless.

mrmodelo commented 5 years ago

As mentioned on https://github.com/grpc/grpc-node/issues/1064 , this is also an issue that is surfacing when using scheduled Firebase Functions, which you do not have access to the grcp config value. When the connection bridge drops the entire suite of firebase functions hosted also begin to fail. The only workaround is to not have the scheduled function deployed, which is not an acceptable solution.

Error: No connection established at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/call.js:68:41) at emitOne (events.js:121:20) at Http2CallStream.emit (events.js:211:7) at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22) at _combinedTickCallback (internal/process/next_tick.js:132:7) at process._tickDomainCallback (internal/process/next_tick.js:219:9)

npomfret commented 5 years ago

Has there been any progress? If not, and the work around is the official way forward are all the docs updated?

hx-markterry commented 5 years ago

Everytime there has been a new version of @google-cloud/pubsub we've updated, but since grpc-js, we've had increased latency in sending and/or receiving (dependant on version used) or memory leaks. In the past weeks we've been pinning versions to work around various bugs.

We've made the decision to go back to "@google-cloud/pubsub": "0.29.1", as this fixes the problem for us with no code changes.

Could we have a newer pubsub release which enables grpc-js as an option? as this seems the most unstable part of this eco system.

bcoe commented 5 years ago

I just wanted to give an update before the weekend, we do have a version of @grpc/grpc-js (0.6.9), that all signs are indicating is stable:

the timeout issue, that we had managed to reproduce on one system, are no longer occurring.
folks running this version of the API have, so far, indicated that they're not seeing any issues.
my colleague @murgatroid99 is feeling confident that he addressed the issues that were leading to the known behavior in this thread.

The reason I was holding off on this update, was that we were doing more stress testing on the system that we had managed to reproduce this issue on.

If anyone is still bumping into issues on 0.6.9, please:

open a new issue on PubSub, so that we can debug your issue in isolation (just in case there's more than one thing being debugged in this thread).
run your environment with the following environment variables set:

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

provide your logs to us with the gRPC output immediately before an error occurred

So far, with debug information, @murgatroid99 has been able to address issues almost immediately.

If you do not want to share your logs publicly (understandably) you can open an issue through our issue tracker, and also email me (bencoe [at] google.com so that I can make sure it's escalated immediately):

https://issuetracker.google.com/savedsearches/559741

Now, if folks start using @grpc/grpc-js@0.6.9, and it becomes apparent that it is not in fact stable, I will take steps to move us back to grpc immediately in @google-cloud/pubsub (until such time that we are confident).

npomfret commented 5 years ago

@bcoe I don't fully understand the fix.

In order to get my system working, a rollback of pubsub to 29.1 wasn't enough. I had to roll back a bunch of other google projects I'm using (logging, storage, kms etc) also. I don't know what combination of rollbacks fixed the problem and so I'm worried about rolling forward now.

What exactly is the proposed fix please?

Which projects do I need to apply the fix to?

pmcnr-hx commented 5 years ago

I've done a fresh install of 1.1.2 (which now is pulling @grpc/grpc-js@0.6.9) in a project and can confirm it looks stable. I will keep it under observation for the next 24 hours and upgrade the remaining projects based on the results.

pmcnr-hx commented 5 years ago

After a few more hours of testing, the behaviour has improved but I'm still seeing the odd message being lost and nacked back into the queue, which results in latency spikes. This does not happen with 0.29.1.

This means0.29.1 is still the latest version that is giving us consistently predictable latency and 1.1.2 + @grpc/grpc-js@0.6.9 is still not performing as well as 0.29.1.

bcoe commented 5 years ago

👋 as mentioned Friday, we're testing early this week with @grpc/grpc-js@0.6.9 as a stable release candidate for @grpc/grpc-js.

We ask that folks upgrade to @grpc/grpc-js@0.6.9, and let us know if you have any trouble doing so.

If you continue to run into issues with this new version of the dependency, I ask that we:

1. create a new issue on this repo, which I will prioritize as P1

@pmcnr-hx, I have already done so for the memory issue you've raised.

2. run your system with debugging enabled, so that we can ship logs to the gRPC folks

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

3. share the logs with the engineers debugging this issue

You can open an issue on the issue tracker here to deliver the logs, if there's anything you wish to keep private.

https://issuetracker.google.com/savedsearches/559741

You can also send an email to bencoe [at] google.com, so that he can make sure things get escalated appropriately.

If it becomes obvious that things are stable, we will start working on a more significant rollback early this week.

googleapis / nodejs-pubsub