Closed mahaben closed 5 years ago
@callmehiphop is there anything in particular?
I'm running a node app with (some node 11 version) on a basic GCE instance (f1-micro). It's been running fine for a long time and at 16:35:28 Oct 09 (UTC) i started getting a variety of pubsub errors (listed above). The impact is that most of my subscribers aren't receiving any messages. Sending seems to be ok, I can see a build up of messages via the console. I've had a few messages consumed after a restart but it fails pretty quickly after that (like in a few seconds).
@npomfret 's experience would line up with my attempt to replicate the issue in GAE. However, since the instances in GAE are restarting or new instances come online when the service is unhealthy this deployment method is impacted slightly less.
@npomfret could you tell me a little more about your subscribers, like
Note: I have a playground environment that I can tune using Locust to simulate load.
How many messages would you approximate are in the queue? This doesn't matter, errors present at ~50 messages per second or ~500 messages per second
What kind of flow control settings do you have configured? I am loading GAE Standard with Locust to have an appropriate number of instances to pull messages at a rate that matches the publisher.
What is your ack deadline set to? Defaults
How many subscription instances are you creating? Depends on the load generated on the publisher service (for 500 mps to PubSub, I have 5 GAE Standard F4 instances in the puller service)
How long on average does it take to ack a message? < 1 second, until it doesn't
Here are my observations with latest libraries, in case they help anybody to reproduce. Rest of the stack same as I described in other messages in this thread.
We have a regional k8s cluster with 3 pods---if you do not understand k8s these are basically three nodejs processes running the same code. Each pod receives messages from the same subscription, about 10 messages arrive almost simultaneously at the beginning of each minute. Before latest changes, and this issue, ie when things were working fine, this was load balancing arriving messages among the three pods perpetually, no errors observed, each message was randomly processed by one pod/instance.
Here is the reconstruction of events from the log, similar to what we have observed the past few days:
at MessageStream._waitForClientReady (/mgmt/node_modules/@google-cloud/pubsub/build/src/message-stream.js:318:19)
code: 4,
details: 'Failed to connect before the deadline',
metadata: Metadata { internalRepr: Map {}, options: {} } }"
Around the same time all other subscriptions report errors in the same pod No message ever received again by this pod
at MessageStream._waitForClientReady (/mgmt/node_modules/@google-cloud/pubsub/build/src/message-stream.js:318:19)
code: 4,
details: 'Failed to connect before the deadline',
metadata: Metadata { internalRepr: Map {}, options: {} } }"```
Around the same time many other subscriptions report errors in the same pod
This whole story is consistent to what we have seen the past few days.
@gae123 there's a chance we have a reproduction, we're working towards testing on a few @grpc/grpc-js
versions, so that we can better isolate where the breakage occurs. Will keep you updated.
I've hit this issue too - for now, I'm using this workaround:
const initSubscriber = () => {
const pubsub = new PubSub();
const subscription = pubsub.subscription(topic, options);
subscription.on('message', handler.handleMessage);
subscription.on('error', e => {
console.error('Subscriber error handler caught exception. Restarting subscription.', e);
initSubscriber();
});
};
initSubscriber();
@Redgwell what environment are you running PubSub in (Kubernetes Engine? App Engine?).
@bcoe Running in GKE.
"@google-cloud/pubsub": "^1.1.0",
@Redgwell a couple more follow up questions:
@bcoe
I did a quick google and landed here. Locally, I was able to replicate the same error message and behaviour (that is, no longer receiving messages after the error) by:
To make this a little faster to replicate, I set the streamingOptions timeout value in the pubsub options object to have a shorter streaming connection timeout:
streamingOptions: {
timeout: 5000
}
One thing that might be useful for fixing this issue - if you look at my workaround, inside the initSubscriber function, I create a new PubSub() object. I found that if I didn't do this inside the initSubscriber function, and instead did it just once when the app starts, the application just kept getting deadline exceeded messages over and over again, even once I'd reconnected to the internet.
That makes me thing there is some state being set in the PubSub object that puts it in a faulted state until it's recreated. Sorry, I don't have the time right now to dig into that assertion any further.
@Redgwell thank you this is very thorough, and gives us a good starting point for debugging.
@Redgwell, could I bother you to test with @grpc/grpc-js@0.6.8
, if you delete your node_modules
and package-lock.json
, and run npm install @google-cloud/pubsub
, you should get it now. (you can run npm ls
post install to confirm).
We believe we've potentially addressed the bug requiring you to resubscribe.
Okay, I grabbed the latest as you said - here's the pubsub dependency tree from npm ls
:
├─┬ @google-cloud/pubsub@1.1.2
│ ├─┬ @google-cloud/paginator@2.0.1
│ │ ├── arrify@2.0.1 deduped
│ │ └── extend@3.0.2 deduped
│ ├── @google-cloud/precise-date@1.0.1
│ ├── @google-cloud/projectify@1.0.1
│ ├── @google-cloud/promisify@1.0.2
│ ├── @sindresorhus/is@1.2.0
│ ├─┬ @types/duplexify@3.6.0
│ │ └── @types/node@12.7.12
│ ├── @types/long@4.0.0
│ ├── arrify@2.0.1
│ ├── async-each@1.0.3
│ ├── extend@3.0.2
│ ├─┬ google-auth-library@5.4.1
│ │ ├── arrify@2.0.1 deduped
│ │ ├── base64-js@1.3.1
│ │ ├── fast-text-encoding@1.0.0
│ │ ├─┬ gaxios@2.0.1
│ │ │ ├── abort-controller@3.0.0 deduped
│ │ │ ├── extend@3.0.2 deduped
│ │ │ ├─┬ https-proxy-agent@2.2.2
│ │ │ │ ├─┬ agent-base@4.3.0
│ │ │ │ │ └─┬ es6-promisify@5.0.0
│ │ │ │ │ └── es6-promise@4.2.8
│ │ │ │ └── debug@3.2.6 deduped
│ │ │ └── node-fetch@2.6.0 deduped
│ │ ├─┬ gcp-metadata@3.2.0
│ │ │ ├── gaxios@2.0.1 deduped
│ │ │ └─┬ json-bigint@0.3.0
│ │ │ └── bignumber.js@7.2.1
│ │ ├─┬ gtoken@4.1.0
│ │ │ ├── gaxios@2.0.1 deduped
│ │ │ ├─┬ google-p12-pem@2.0.2
│ │ │ │ └── node-forge@0.9.1
│ │ │ ├── jws@3.2.2 deduped
│ │ │ └── mime@2.4.4
│ │ ├─┬ jws@3.2.2
│ │ │ ├─┬ jwa@1.4.1
│ │ │ │ ├── buffer-equal-constant-time@1.0.1
│ │ │ │ ├─┬ ecdsa-sig-formatter@1.0.11
│ │ │ │ │ └── safe-buffer@5.2.0 deduped
│ │ │ │ └── safe-buffer@5.2.0 deduped
│ │ │ └── safe-buffer@5.2.0 deduped
│ │ └─┬ lru-cache@5.1.1
│ │ └── yallist@3.1.1
│ ├─┬ google-gax@1.7.4
│ │ ├─┬ @grpc/grpc-js@0.6.8
│ │ │ └── semver@6.3.0 deduped
│ │ ├─┬ @grpc/proto-loader@0.5.2
│ │ │ ├── lodash.camelcase@4.3.0
│ │ │ └── protobufjs@6.8.8 deduped
│ │ ├─┬ abort-controller@3.0.0
│ │ │ └── event-target-shim@5.0.1
│ │ ├─┬ duplexify@3.7.1
│ │ │ ├─┬ end-of-stream@1.4.4
│ │ │ │ └── once@1.4.0 deduped
│ │ │ ├── inherits@2.0.4 deduped
│ │ │ ├── readable-stream@2.3.6 deduped
│ │ │ └── stream-shift@1.0.0
│ │ ├── google-auth-library@5.4.1 deduped
│ │ ├── is-stream-ended@0.1.4 deduped
│ │ ├── lodash.at@4.6.0
│ │ ├── lodash.has@4.5.2
│ │ ├── node-fetch@2.6.0
│ │ ├── protobufjs@6.8.8 deduped
│ │ ├─┬ retry-request@4.1.1
│ │ │ ├─┬ debug@4.1.1
│ │ │ │ └── ms@2.1.2 deduped
│ │ │ └─┬ through2@3.0.1
│ │ │ └── readable-stream@2.3.6 deduped
│ │ ├── semver@6.3.0 deduped
│ │ └── walkdir@0.4.1
│ ├── is-stream-ended@0.1.4
│ ├── lodash.snakecase@4.1.1
│ ├── p-defer@3.0.0
│ └─┬ protobufjs@6.8.8
│ ├── @protobufjs/aspromise@1.1.2
│ ├── @protobufjs/base64@1.1.2
│ ├── @protobufjs/codegen@2.0.4
│ ├── @protobufjs/eventemitter@1.1.0
│ ├─┬ @protobufjs/fetch@1.1.0
│ │ ├── @protobufjs/aspromise@1.1.2 deduped
│ │ └── @protobufjs/inquire@1.1.0 deduped
│ ├── @protobufjs/float@1.0.2
│ ├── @protobufjs/inquire@1.1.0
│ ├── @protobufjs/path@1.1.2
│ ├── @protobufjs/pool@1.1.0
│ ├── @protobufjs/utf8@1.1.0
│ ├── @types/long@4.0.0 deduped
│ ├── @types/node@10.14.21
│ └── long@4.0.0
I tried my test loop again - with this code:
const initSubscriber = () => {
const pubsub = new PubSub();
const subscription = pubsub.subscription(topic, options);
subscription.on('message', handler.handleMessage);
subscription.on('error', e => {
console.error('Subscriber error handler caught exception. Restarting subscription.', e);
//initSubscriber(); <------ commented out this, because I'm expecting that the fix means we don't need to do this anymore right?
});
};
initSubscriber();
So with that running, I pushed some messages, saw the the app was handling them, then disconnected wifi and waited for the deadline exceeded error. I reconnected wifi, pushed some more messages and waited for about 2 minutes. The app didn't process the messages. After restarting the app, it immediately processed the messages.
So unfortunately, it doesn't seem that this has fixed things, at least for my test loop of disconnecting the internet connection.
We found one service that happily didn't observe these failures and we are now downgrading our affected services to this version until a confirmed working solution is provided:
@google-cloud/pubsub: 0.30.1
google-gax: 1.1.4
@grpc/grpc-js: ^0.4.3
You have to pin the google-gax version as it would pick the latest 1.x.x (check your lock file)
"resolutions": {
"google-gax": "1.1.4"
}
We are also adding the initSubscriber
workaround suggested by @Redgwell - it does seem to help. Also adding initPublisher
to recreate the PubSub client in case publish fails with code 4.
SMP Debian 4.9.144-3.1 (2019-02-19) x86_64
~$ nvm --version
0.34.0
~$ npm --version
6.7.0
~$ yarn --version
1.19.1
~$ node --version
v11.15.0
How many messages would you approximate are in the queue?
The queues are normally empty, some messages take a few hundred seconds to process but most happen within a second or so. Sometimes the subscribers have maybe 10 messages backed up. These are not busy queues.
What kind of flow control settings do you have configured?
I have two types of subscriber. The first has flowControl: { maxMessages: 1, allowExcessMessages: false }
, the second doesn't have any flow control settings. The both suffer from the problems described.
What is your ack deadline set to?
5 minutes for both.
How many subscription instances are you creating?
Each topic has one subscription instance, and each subscription has one nodejs consumer. I have a hundred or so in total.
How long on average does it take to ack a message?
It varies greatly. The first type (with flow control settings) can take a long time to process and then ack a message, like over 100 seconds, but usually it's around 1 second. The 2nd type (with no flow control) are ack'd immediately, so as soon as the message arrives it's ack'd.
@pworkpop pinning the version of google-gax to 1.1.4 didn't work in my case. The key management api blew up:
nodeapp at new KeyManagementServiceClient (node_modules/@google-cloud/kms/src/v1/key_management_service_client.js:130:28)
nodeapp at GrpcClient.loadProto (node_modules/google-gax/build/src/grpc.js:145:21)
nodeapp at GrpcClient.loadFromProto (node_modules/google-gax/build/src/grpc.js:118:44)
nodeapp at Object.loadSync (node_modules/@grpc/proto-loader/build/src/index.js:230:27)
nodeapp at Root.loadSync (node_modules/protobufjs/src/root.js:235:17)
nodeapp at Root.load (node_modules/protobufjs/src/root.js:192:44)
nodeapp TypeError: Cannot read property 'length' of undefined
Same error with google-logging and a bunch of others. I see this in the logs:
nodeapp warning Resolution field "google-gax@1.1.4" is incompatible with requested version "google-gax@^1.6.3"
... however, I think I have a working app now by downgrading a whole bunch of stuff. Don't know what the magic change was in the end but I'm running the following:
"@google-cloud/kms": "1.3.2",
"@google-cloud/logging": "5.2.2",
"@google-cloud/logging-bunyan": "^1.2.3",
"@google-cloud/pubsub": "0.29.1",
"@google-cloud/storage": "3.3.0",
@npomfret I think this "@google-cloud/pubsub": "0.29.1", has done magic. I have had the same issue. Only 0.29.1 is the working version.
Rolling back to 0.29.1 on it's own was definitely not enough in my case. I had to rollback a lot of the other google dependencies before it started working. I will gradually start to roll forward to see if I can find the culprit.
@pwrkpop @npomfret @xoraingroup, rather than rolling back to 0.29.1
of @google-cloud/pubsub
I recommend one of the following workarounds:
@google-cloud/pubsub
, add the dependency grpc
(this is the old gRPC transport layer).use the following code when creating your library:
const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});
:point_up: this same approach can be used for other libraries that use gRPC, e.g.,
const grpc = require('grpc');
const {Logging} = require('@google-cloud/logging');
const logger = new Logging({grpc});
const initSubscriber = () => {
const pubsub = new PubSub();
const subscription = pubsub.subscription(topic, options);
subscription.on('message', handler.handleMessage);
subscription.on('error', e => {
initSubscriber(); });
};
initSubscriber();
We potentially have a reproduction of the issue described in this thread (thanks @Redgwell for pointing us in the right direction), and will hopefully have a fix out soon that makes either of these workaround unnecessary.
will try the first workaround (using the old gRPC transport layer) and will report results soon.
I am not seeing any errors using this method for more than 6 hours:
const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});
@bcoe Seems your trick worked. I have made some tests that are running last 24 hours and no such issue appeared. When can we expect a permanent fix than the above work around?
google-gax: 1.1.4
@grpc/grpc-js: 0.4.3
seem to be stable for us - we have those in production since May
@npomfret sorry about that, we only use pubsub (no kms & logging libs). "@google-cloud/pubsub": "0.29.1"
uses built-in grpc
as @grpc/grpc-js
was added in "@google-cloud/pubsub": "0.30.0"
so bcoe's workaround is effectively the same - use built-in.
I sure wish the @grpc/grpc-js
devs write more tests before pushing changes to prod.
@pwrkpop @npomfret @xoraingroup, rather than rolling back to
0.29.1
of@google-cloud/pubsub
I recommend one of the following workarounds:Using grpc, rather than @grpc/grpc-js
- inside your project which uses
@google-cloud/pubsub
, add the dependencygrpc
(this is the old gRPC transport layer).- use the following code when creating your library:
const {PubSub} = require('@google-cloud/pubsub'); const grpc = require('grpc'); const pubsub = new PubSub({grpc});
this same approach can be used for other libraries that use gRPC, e.g.,
const grpc = require('grpc'); const {Logging} = require('@google-cloud/logging'); const logger = new Logging({grpc});
Using the workaround recommended by @Redgwell
const initSubscriber = () => { const pubsub = new PubSub(); const subscription = pubsub.subscription(topic, options); subscription.on('message', handler.handleMessage); subscription.on('error', e => { initSubscriber(); }); }; initSubscriber();
We potentially have a reproduction of the issue described in this thread (thanks @Redgwell for pointing us in the right direction), and will hopefully have a fix out soon that makes either of these workaround unnecessary.
Cant find anyting on creating PubSub with options includeing grpc? I use projectId, and tried to add grpc, but typescript rejects this.
A peculiar regularity shows here. Every hour it seems that the pubsub stops for a peridod of approximately 5 minutes. Not seen the regularity like this until today. The node app runs as normal, other type of events are showing that. Running the latest version og node pub-sub.
@bcoe I wanted to confirm the workaround, so I did a quick scan of the source. Neither does the PubSub
constructor nor the google_auth_library_1.GoogleAuth
constructor inside of it seem to consume the grpc
object that's passed into them. I'm not sure how this would work. This applies to versions 0.30.3
(the version our project is currently using) and the latest 1.1.2
.
My bad, I see it now. It gets passed to GrpcClient
in google-gax
eventually!
@gberth (as you noticed @RaptDept, the ClientOptions accepted by new PubSub
extends gax.GrpcClientOptions
), which is where the grpc
parameter is defined:
export interface ClientConfig extends gax.GrpcClientOptions {
apiEndpoint?: string;
servicePath?: string;
port?: string | number;
sslCreds?: ChannelCredentials;
}
export interface GrpcClientOptions extends GoogleAuthOptions {
auth?: GoogleAuth;
promise?: PromiseConstructor;
grpc?: GrpcModule;
}
I've tested setting grpc
and can confirm it toggles the transport layer.
I've downgraded this to p2
, not because it's not top of mind, but because the grpc
workaround seems to be working for most folks. We're continuing to dig into things on our end, and are attempting to figure out the root causes of the issues we've seen with @grpc/grpc-js
.
Have issues been logged in all other client libraries that depend on
grpc-js
that are affected, for tracking purposes, and to ensure all users
can find this bug if they're searching online for help?
@bcoe Why not releasing a new version using the grpc
workaround? Every developer upgrading @google-cloud/pubsub
will encounter this issue.
This issue is hard to catch on dev environments as you have to wait one hour for the error to be triggered. Therefore we can assume that at least some of them will end up pushing broken code to production.
@MatthieuLemoine @MichaelMarkieta I have been running a PubSub consumer for 4d15h
now, on Kubernetes Engine, without a single issue.
NAME READY STATUS RESTARTS AGE
good-reader-8f5fbb755-jbf28 1/1 Running 0 4d15h
This is an issue hitting a percentage of our users, but is not hitting 100% of library users, and we are continuing to attempt to find the actual root cause.
This is why we haven't opted to switch the grpc
dependency.
This is an issue hitting a percentage of our users, but is not hitting 100% of library users
this comment makes no sense from optics and I totally agree with @MatthieuLemoine release a proper fix for this or one with this 'workaround' built in, asking customers to change their production code with some speculative 'fix' is irresponsible
what happens when this is actually fixed and this ends up causing more problems later
the updated documentation does not even mention what kind of workloads would be better to use native or not making the mere suggestion of using it even more confusing and bug prone potentially
This has been a workaround for us too, no issues in the last 24 hours. I ll keep monitoring....
const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});
Asking customers to change their production code with some speculative 'fix' is irresponsible
We're very much trying to avoid this, I realize how frustrating the string of patches to @grpc/grpc-js
was last week. Which is why we've taken a step back, and asked folks to opt for the grpc
library instead, since we've seen this consistently address the issues that people are experiencing.
Our libraries were designed to allow grpc
as an alternative to @grpc/grpc-js
, specifically in case a situation like this arose (where we saw inconsistencies between the two libraries impacting users).
Now, even though we are advising that folks running into immediate issues switch to grpc
,
We are continuing to try to ascertain a consistent reproduction of the issue effecting users. What has made this difficult, is that we see the default configuration (with @grpc/grpc-js
) running effectively for many people -- I've had a cluster running for 5 days now, without the behavior reported, and we haven't been able to recreate the issue consistently with various attempts at stress testing in various environments.
Rather than continuing to float patches to @grpc/grpc-js
speculatively (which I agree is irresponsible), we are taking the following approach:
GRPC_TRACE=all
and GRPC_VERBOSITY=DEBUG
, which is allowing us to deliver to the gRPC
team the info they need to figure out what the heck is happening to some people.grpc
.Have run with grpc since 10.30 AM today. App. 400' msgs read. No errors, no hangups. For information - messages like these below (had app 200 of them during 46 hours) has also disappeared
(node:16) Error: Failed to add metadata entry A...: Mon, 14 Oct 2019 11:40:19 GMT. Metadata key "a..." contains illegal characters
@gberth Thanks for letting us know! Just to give you some details, the metadata warnings are caused not by @grpc/grpc-js
but by some (unknown) bug in Node.js http2
implementation. The bug is here https://github.com/nodejs/node/issues/28632 for tracking purposes but no useful debug info is there yet.
grpc
uses its own http2
stack, so the Node.js http2
module bug does not affect it - that's why those messages disappeared.
just adding some observations in case it helps. have two services both running pubsub 1.0.0. One is older, and have no issues, one is newer and have the issue. have not tried to roll back the new service, using the workaround for now.
Here are the differences when running "npm ls @grpc/grpc-js"
old: ├─┬ @google-cloud/logging@4.0.1 │ └─┬ google-gax@0.20.0 │ └── @grpc/grpc-js@0.2.0 ├─┬ @google-cloud/pubsub@1.0.0 │ ├── @grpc/grpc-js@0.5.4 deduped │ └─┬ google-gax@1.6.0 │ └── @grpc/grpc-js@0.5.4 deduped └── @grpc/grpc-js@0.5.4
new └─┬ @google-cloud/pubsub@1.0.0 ├── @grpc/grpc-js@0.5.4 └─┬ google-gax@1.6.4 └── @grpc/grpc-js@0.6.6
@bcoe
We run on GKE using credentials passed to PubSub and GoogleAuth (not the GKE service account). Most services run 0.30.1
with google-gax: 1.1.4
& @grpc/grpc-js: 0.4.3
without issues since 0.30.1 was released in Jun. One service was updated to google-gax: 1.6.4
& @grpc/grpc-js: 0.6.6
started observing the issues last week.
We have this solution for token auto-refresh also
this._client = new PubSub({ ...queueConfig, auth: new GoogleAuth(queueConfig) });
Issue doesn't start exactly one hour after start for us so I suspect it is related to connectivity to pubsub servers and how those are rolled in and out of service for updates.
Re-creating the PubSub client (thus the underlying channel) definitely solves this, so I suggest adding this fix to pubsub itself.
We occasionally see deadline exceeded messages in the old services running @grpc/grpc-js: 0.4.3
but those seem to recover all right.
For people that see this happen every hour it may be related to token expiry - unable to refresh the token once it expires - we used to have unauthorised errors without passing that auth
parameter.
I'd like to :+1: this going back to a P1. Providing a workaround isn't an acceptable response. Indeed, we have implemented the workaround and rolled out to Prod to see that while it successfully mitigated the lost pub/sub connection, it also introduces a memory leak that requires periodic restart of our k8s pods regardless.
As mentioned on https://github.com/grpc/grpc-node/issues/1064 , this is also an issue that is surfacing when using scheduled Firebase Functions, which you do not have access to the grcp config value. When the connection bridge drops the entire suite of firebase functions hosted also begin to fail. The only workaround is to not have the scheduled function deployed, which is not an acceptable solution.
Error: No connection established at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/call.js:68:41) at emitOne (events.js:121:20) at Http2CallStream.emit (events.js:211:7) at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22) at _combinedTickCallback (internal/process/next_tick.js:132:7) at process._tickDomainCallback (internal/process/next_tick.js:219:9)
Has there been any progress? If not, and the work around is the official way forward are all the docs updated?
Everytime there has been a new version of @google-cloud/pubsub we've updated, but since grpc-js, we've had increased latency in sending and/or receiving (dependant on version used) or memory leaks. In the past weeks we've been pinning versions to work around various bugs.
We've made the decision to go back to "@google-cloud/pubsub": "0.29.1",
as this fixes the problem for us with no code changes.
Could we have a newer pubsub release which enables grpc-js as an option? as this seems the most unstable part of this eco system.
I just wanted to give an update before the weekend, we do have a version of @grpc/grpc-js
(0.6.9
), that all signs are indicating is stable:
The reason I was holding off on this update, was that we were doing more stress testing on the system that we had managed to reproduce this issue on.
If anyone is still bumping into issues on 0.6.9
, please:
open a new issue on PubSub, so that we can debug your issue in isolation (just in case there's more than one thing being debugged in this thread).
run your environment with the following environment variables set:
GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG
So far, with debug information, @murgatroid99 has been able to address issues almost immediately.
If you do not want to share your logs publicly (understandably) you can open an issue through our issue tracker, and also email me (bencoe [at] google.com
so that I can make sure it's escalated immediately):
https://issuetracker.google.com/savedsearches/559741
Now, if folks start using @grpc/grpc-js@0.6.9
, and it becomes apparent that it is not in fact stable, I will take steps to move us back to grpc
immediately in @google-cloud/pubsub
(until such time that we are confident).
@bcoe I don't fully understand the fix.
In order to get my system working, a rollback of pubsub to 29.1 wasn't enough. I had to roll back a bunch of other google projects I'm using (logging, storage, kms etc) also. I don't know what combination of rollbacks fixed the problem and so I'm worried about rolling forward now.
What exactly is the proposed fix please?
Which projects do I need to apply the fix to?
I've done a fresh install of 1.1.2
(which now is pulling @grpc/grpc-js@0.6.9
) in a project and can confirm it looks stable. I will keep it under observation for the next 24 hours and upgrade the remaining projects based on the results.
After a few more hours of testing, the behaviour has improved but I'm still seeing the odd message being lost and nacked back into the queue, which results in latency spikes. This does not happen with 0.29.1
.
This means0.29.1
is still the latest version that is giving us consistently predictable latency and 1.1.2
+ @grpc/grpc-js@0.6.9
is still not performing as well as 0.29.1
.
👋 as mentioned Friday, we're testing early this week with @grpc/grpc-js@0.6.9
as a stable release candidate for @grpc/grpc-js
.
We ask that folks upgrade to @grpc/grpc-js@0.6.9, and let us know if you have any trouble doing so.
If you continue to run into issues with this new version of the dependency, I ask that we:
@pmcnr-hx, I have already done so for the memory issue you've raised.
GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG
You can open an issue on the issue tracker here to deliver the logs, if there's anything you wish to keep private.
https://issuetracker.google.com/savedsearches/559741
You can also send an email to bencoe [at] google.com
, so that he can make sure things get escalated appropriately.
If it becomes obvious that things are stable, we will start working on a more significant rollback early this week.
Environment details
Node.js version: v12.7.0 npm version: 6.10.0 @google-cloud/pubsub version: "^1.0.0",
Error:
insertId: "gnr3q1fz7eerd" jsonPayload: { level: "error" message: "unhandledRejection" originalError: { ackIds: [1] code: 4 details: "Deadline exceeded" } }
After receiving this error, the app does not receive messages anymore and we have to exit the application to recreate the kubernetes pod.
Any help would be appreciated!