Closed tulios closed 5 years ago
Hi @tulios
Um, I'm still having all sorts of problems arising sometimes here and there in the processes, sometimes it's the lock contention error, sometimes it's the following errors:
{"level":"WARN","timestamp":"2019-03-11T01:21:22.932Z","logger":"kafkajs","message":"[RequestQueue] Response without match","clientId":"ps-8682-router","broker":"10.240.0.7:9092","correlationId":4278511}
{"level":"ERROR","timestamp":"2019-03-11T01:21:22.947Z","logger":"kafkajs","message":"[Producer] Request Metadata(key: 3, version: 4) timed out","retryCount":0,"retryTime":359}
It's good also to mention that I see these 2 error types, way way more than the lock contention.
In our usecase, a consumer consumes messages, say a million, sends them (downstream, system generated) and then from the clients, the process receives 10X more messages (upstream, client generated) that need to be produced to some other kafka topics).
What I suppose happens is that due to the high load, lots of events are created in the same eventLoop tick (load on the producer side), delaying the consumer periodic heartbeat, so much until the broker loses track of the consumer, consumer tries to connect back to the cluster, but it's faced with the "Response without match" error, which I don't know why it happens actually. It seems to me that producers are somehow overloaded (because they have to handle a very huge spiky load), then they try to retry and that only makes things worse, so consumer sessions gets invalidated because CPU is exhausted to the end and there's no time for a heartbeat.
I don't know if my intuition is right here also, just something I have in mind, but I'd be glad if you could provide any insights.
Hi @AlirezaSadeghi, this is a high priority issue for me, I'll bug you a bit this week about this. I'm assuming that the "Response without match" errors are happening on version 1.5.0
, right? I have a fix for it (it's related to #309).
Are the producers and consumers on the same machine (or same node.js process)? Because the requests are timing out, and the lock is timing out, so it's related to exhaustion of the event loop. I'll investigate the issue this week and get back to you with more data.
Just as temporary mitigation you can increase connectionTimeout
and requestTimeout
.
That would be great! I'll be all out helping fix this issue because it's really reducing the reliability of the messaging mechanism.
About the consumers and producers, yes, they are in the same process, there's basically one consumer and one producer in the same process. (i.e app.js starts up the producer first, and then starts the consumer, consumer starts consuming messages and sending them, and then messages start coming back to the process and here, producer starts sending them back to a kafka topic)
And initially, I was using 1.4.7, I'm giving 1.5.0 a try now to see how it behaves. I'll also increase those two again, but they are already 60 seconds or so.
Hey @AlirezaSadeghi, I'm still looking at this. I made some tests, and I have a lead on the problem, but I haven't found the time to debug it fully.
@AlirezaSadeghi I have performed a lot of CPU profiling, and I couldn't find anything yet, one thing that I noticed while debugging was that in your example you never await
for queueMessage
, which means you are generating 100k promises on the same tick, was that a typo or this is how the system is working?
@AlirezaSadeghi I think I improved things for you, turned out that I was a victim of my recommendations, the lock release mechanism was using Promise.all
and looking at your example the code could reach 1k+ locked resources, so this would overload the system. PR #323 should improve this, I've tried the change with your example, and it worked fine.
@AlirezaSadeghi I have performed a lot of CPU profiling, and I couldn't find anything yet, one thing that I noticed while debugging was that in your example you never
await
forqueueMessage
, which means you are generating 100k promises on the same tick, was that a typo or this is how the system is working?
Um Yeah that's how the system works. Messages are received through sockets in parallel and lots of emit
s might be called in a single tick, that's why I didn't put the await
there when calling the queueMessage
function. I don't think we can reproduce what happens there when we have the await
key and everything's handled sequentially.
Also thanks for the fix, any Idea how much would it take for this to end up as a release ? If not, I'll just take the code in master and spin in production to see how it performs.
If you can, I would recommend taking the code in master for a while, but I think the improvement is significative enough to release 1.5.2
. I want to make sure that I won't release 1.5.4
right afterward with another fix, so I like to let it sink for a while. But we can get a new version next week.
@AlirezaSadeghi 1.5.2
was released with the fix.
From a comment on issue #177
@AlirezaSadeghi I moved the investigation to a new issue