tulios / kafkajs

A modern Apache Kafka client for node.js
https://kafka.js.org
MIT License
3.75k stars 527 forks source link

Crash: Error: value should be an instance of Encoder #1153

Open nikakaltura opened 3 years ago

nikakaltura commented 3 years ago

Describe the bug We are using Nest.js and kafka as message broker. Kafkajs worked great until now. We have multiple instances of api gateway, which is listening to the reply topic. Api gateway has 6 kafka clients, each of them listening to the their own topic. Module A client is listening to topic A only. Module B client is listening to topic B only...

When we are deploying api-gw, all 6 at the same time, it's naturally killing the old instances. So, 6 consumers leave and another 6 join at the same time. This is causing a lot of rebalances, which wasn't an issue until now. Right now, after all these rebalances kafka js is throwing 2 errors.

First: ERROR [Consumer] Crash: KafkaJSNumberOfRetriesExceeded: Specified group generation id is not valid

And after that: ERROR [Consumer] Crash: Error: value should be an instance of Encoder

After second error, all kafka clients die, all connections dropped, no retries are triggered and Api gateway just hangs there returning empty responses to the client

To Reproduce

If none of the above are possible to provide, please write down the exact steps to reproduce the behavior:

  1. Connect 6 consumers to a topic.
  2. Connect 6 more consumers to the topic and also kill original 6 consumers
  3. Wait for an error. If no error, try couple of times.

Expected behavior To connect to kafka after a rebalance

Environment:

nikakaltura commented 3 years ago

I was able to temporarily fix the issue by increasing maxRetryTime to higher number.

Nevon commented 3 years ago

The second error you mentioned is interesting. Could you set up a listener for the CRASH event and log out not just the error message, but also error.stack? The error is being thrown from here, probably because it's being called with null or undefined, but this method is called from a lot of places, so it would be interesting to understand from where:

https://github.com/tulios/kafkajs/blob/813cbfdfae6107f428befc376457d4af208fe921/src/protocol/encoder.js#L255

If you have the ability to reproduce this locally, it would also be super helpful to have debug output including protocol buffers from when this happens. Running with KAFKAJS_LOG_LEVEL=debug KAFKAJS_DEBUG_PROTOCOL_BUFFERS=1 would output this, but also increase logging volume by a tonne, so probably you don't want to do this in production.

Note that this would contain any data sent between your client and the broker, so if you can only reproduce this with real customer data, this might not be an option. If you can trigger it with test data, that would be 💯

Kannndev commented 3 years ago

Hi @Nevon we are also facing the same issue. I have pasted the error stack below. Can you please help us with this?

Crash: Error: value should be an instance of Encoder

Error: value should be an instance of Encoder\n at Encoder.writeEncoder (/usr/src/app/node_modules/kafkajs/src/protocol/encoder.js:211:13 undefined)\n at /usr/src/app/node_modules/kafkajs/src/protocol/encoder.js:281:18\n at Array.forEach ()\n at Encoder.writeArray (/usr/src/app/node_modules/kafkajs/src/protocol/encoder.js:272:13 undefined)\n at /usr/src/app/node_modules/kafkajs/src/consumer/assignerProtocol.js:51:44\n at Array.map ()\n at Object.encode (/usr/src/app/node_modules/kafkajs/src/consumer/assignerProtocol.js:50:33 undefined)\n at /usr/src/app/node_modules/@nestjs/microservices/helpers/kafka-reply-partition-assigner.js:108:78\n at Array.map ()\n at KafkaReplyPartitionAssigner.assign (/usr/src/app/node_modules/@nestjs/microservices/helpers/kafka-reply-partition-assigner.js:106:40 undefined)\n at ConsumerGroup.[private:ConsumerGroup:sync] (/usr/src/app/node_modules/kafkajs/src/consumer/consumerGroup.js:176:35 undefined)\n at processTicksAndRejections .processTicksAndRejections (internal/process/task_queues.js:93:5 undefined)\n at async /usr/src/app/node_modules/kafkajs/src/consumer/consumerGroup.js:291:9\n at async Runner.join (/usr/src/app/node_modules/kafkajs/src/consumer/runner.js:77:5 undefined)\n at async Runner.start (/usr/src/app/node_modules/kafkajs/src/consumer/runner.js💯7 undefined)\n at async start .async start (/usr/src/app/node_modules/kafkajs/src/consumer/index.js:236:7 undefined)

stavalfi commented 3 years ago

im also seeing this error in version: 1.16.0-beta.18,1.16.0-beta.22

Nevon commented 3 years ago

@Kannndev: I don't think your issue is the same. Since you're using a custom partition assigner, my bet would be that it's returning something that's not a valid assignment, and you're simply running into validation that's telling you that your assignment isn't valid.

stavalfi commented 3 years ago

my mistake - by mistake, i send an object to "topic" instead of a string.