ntent / kafka4net

C# client for Kafka
Apache License 2.0
52 stars 32 forks source link

Something is leaking when under pressure #11

Closed frankbohman closed 9 years ago

frankbohman commented 9 years ago

we will try to find out exactly where, we are pushing a high amount of fairly big messages through, with great performance: it seems, but they seem to get stacked?

thunderstumpges commented 9 years ago

Hi, I have not looked much into this issue, but we have also sent data at a pretty high rate (>4,000/s) without "leaking" anything, over quite a long time (weeks). What is "a high amount" and what is "fairly big" ? All of these things are relative I believe, and many factors are involved. Our messages are fairly small (1-5kb), and our network is quite fast (dual bonded 10g ethernet) I am wondering if you are just sending too much data for the producer/network to keep up with?

Have you checked the setting "AutoGrowSendBuffers" (defaults to true) and "SendBuffersInitialSize" (defaults to 200 msgs). If you are under a situation where the sending routine can't keep up with incoming messages, the internal ring buffer will need to be expanded to hold new messages. The Producer Send() message is always asynchronous, so you will never block on that call. By setting the "AutoGrowSendBuffers" to false and setting your "SendBuffersInitialSize" to some reasonable limit for your application, you can force a "PermError" to be thrown and messages returned to you that could not be added to the internal buffer. That should prevent the "leak" if the issue is really that you cannot send fast enough.

Another thought is how many partitions/brokers you have? Sending is done in parallel across the number of brokers/partitions. If you have fewer partitions than brokers, you won't be sending to all brokers but only the ones that own partitions. The buffer settings above are also per-partition, so having very few partitions can lead to issues here also. And finally, if the partition leadership is not balanced across your brokers, you can bottleneck on a single broker as well.

And finally, you can enable debug logging for the producer and should be able to get information about buffer resizing events (that one's actually a WARN level log) as well as send times, batch sizes, etc.

Good luck, and if you prefer to capture a log and send it our way, we might be able to help determine the issue.

frankbohman commented 9 years ago

thank you for your response to this, your client is by far the fastest one we have tried out, and so we really want it to work for us.

we tried the autogrowsenbuffer false option, capturing the messages in onpermerror and putting them back in a sendqueue, however the idea about that something might be wonky with our leaders/partitions might be a good lead to investigate, we use 10 partitions, and looking at the open tcp connections it seems that there are only 4 open.

Also this "maybe not" leak, seems to appear when we send data bigger than some size, otherwise we have had ~22k throughput without problems.

i will try to document exactly what payload size we are sending. Its actually a bunch of data thats beeing moved and serialized from a legacy system, that why i don't allredy know that :)

thank you very much for your help! best regard /F

On Tuesday, September 8, 2015, thunderstumpges notifications@github.com wrote:

Hi, I have not looked much into this issue, but we have also sent data at a pretty high rate (>4,000/s) without "leaking" anything, over quite a long time (weeks). What is "a high amount" and what is "fairly big" ? All of these things are relative I believe, and many factors are involved. Our messages are fairly small (1-5kb), and our network is quite fast (dual bonded 10g ethernet) I am wondering if you are just sending too much data for the producer/network to keep up with?

Have you checked the setting "AutoGrowSendBuffers" (defaults to true) and "SendBuffersInitialSize" (defaults to 200 msgs). If you are under a situation where the sending routine can't keep up with incoming messages, the internal ring buffer will need to be expanded to hold new messages. The Producer Send() message is always asynchronous, so you will never block on that call. By setting the "AutoGrowSendBuffers" to false and setting your "SendBuffersInitialSize" to some reasonable limit for your application, you can force a "PermError" to be thrown and messages returned to you that could not be added to the internal buffer. That should prevent the "leak" if the issue is really that you cannot send fast enough.

Another thought is how many partitions/brokers you have? Sending is done in parallel across the number of brokers/partitions. If you have fewer partitions than brokers, you won't be sending to all brokers but only the ones that own partitions. The buffer settings above are also per-partition, so having very few partitions can lead to issues here also. And finally, if the partition leadership is not balanced across your brokers, you can bottleneck on a single broker as well.

And finally, you can enable debug logging for the producer and should be able to get information about buffer resizing events (that one's actually a WARN level log) as well as send times, batch sizes, etc.

Good luck, and if you prefer to capture a log and send it our way, we might be able to help determine the issue.

— Reply to this email directly or view it on GitHub https://github.com/ntent-ad/kafka4net/issues/11#issuecomment-138675887.

thunderstumpges commented 9 years ago

Ok, well I am guessing that you need a little more instrumentation and experimentation. If you set autogrowsendbuffer to false and get messages back in OnPermError, if you just put them somewhere else in an in-memory queue, you haven't solved the problem which is you just can't keep up with the rate you're trying to send at. If at all possible, when you get the auto-grow exception, you should throttle the sending rate. Another thing you can do to throttle, is keep a count of "pending messages" by registering for the "OnSuccess" handler as well. Each time you Send(), increment a count. Each OnSuccess, decrement. Pause sending on some reasonable size.

As for your open connections, with 10 partitions you should see up to 10 connections (if you have 10 brokers). If you have only 4 brokers, you'd only see 4 connections.

Good luck, and I'd love to hear your overall throughput (both messages/sec and bytes/sec) when you get it sorted, sounds like you're giving it a good pushing!

vchekan commented 9 years ago

@frankbohman you will have better chances of being helped if you post essential details. You mentioned you are getting some errors in perm error handler, but you have not specified which errors those are.

Generally speaking, perm handler is triggered when the driver believes that it can not recover from particular error. For example, message is larger than kafka server would accept, or partition does not exist, etc. Maybe you re-queue messages which can not be sent for some reason, and as time goes, there is more and more such messages?

frankbohman commented 9 years ago

thanks again for the replies, we did exactly as suggested by a chance today, that is: a counter to keep track of sent vs successfully sent messages, and it works lika charm :), we also have the buffer fixed and autogrow turned off ofc.

the message sizes varies between 2b to 700k, and as you have said, it. seems the streams with the lager messages are the ones backing up, and the fact that we just put them back in queue wasnt a very good solution :)

i will get back to you on our findings when we have had the solution a bit more tested, the help you allready provided really got us in the right direction so im hopeful this will turn out to be a great thing.

while at it, is there any plans on adding compression to the library ?

regards /F

On Wednesday, September 9, 2015, Vadim Chekan notifications@github.com wrote:

@frankbohman https://github.com/frankbohman you will have better chances of being helped if you post essential details. You mentioned you are getting some errors in perm error handler, but you have not specified which errors those are.

Generally speaking, perm handler is triggered when the driver believes that it can not recover from particular error. For example, message is larger than kafka server would accept, or partition does not exist, etc. Maybe you re-queue messages which can not be sent for some reason, and as time goes, there is more and more such messages?

— Reply to this email directly or view it on GitHub https://github.com/ntent-ad/kafka4net/issues/11#issuecomment-138986327.

vchekan commented 9 years ago

@frankbohman glad you made progress on the issue. Let us know if you have more problems.

If you are familiar with Windows Event Tracing, you can get a lot of useful information by using some tracing tool ( I would recommend PerfView, if you do not have a favorite yet). The driver publishes all essential details of its internal working under "kafka4net" source: https://github.com/ntent-ad/kafka4net/blob/master/src/Tracing/EtwTrace.cs I'll try to find time to document its usage. And PerfView supports memory profiling too, could be useful in your case. Another memory profiler could be Visual Studio 2015, which has embedded memory profiler.

Log4net/Nlog is supported too, but with less details. Main goal are events important for user application, such as broker going down/up, connection failures and recovery, etc. See exampe of NLog integration here: https://github.com/ntent-ad/kafka4net/blob/master/tests/RecoveryTest.cs#L85

Compression is on the short list, but right now we do not have pressing need. It is "nice to have" priority for us. So no immediate plans.

avoxm commented 8 years ago

It looks like there is really a memory leak. The issue is that after sending a big chunk of data kafka4net still keeps a reference to some data that keeps memory from being collected.

This is quite easy to reproduce

  1. Bulk Send (1 mln probably)
  2. Wait till everything is sent
  3. Close connection
  4. Run GC
  5. You will notice that kafka4net still keeps a reference to big amount of memory.

You can also run a memory profiler to get more idea.

I will open a new issue.