mozilla-services / heka

DEPRECATED: Data collection and processing made easy.
http://hekad.readthedocs.org/
Other
3.39k stars 529 forks source link

KafkaOutput: Failed Connections Are Not Cleaned Up #1518

Open emam0 opened 9 years ago

emam0 commented 9 years ago

I have a Heka KafkaOutput configured with a 3-node Kafka cluster. I noticed that after network disruptions or Kafka outages, some nodes running Heka would run into this scenario:

  1. Increased number of stalled connections to the Kafka cluster, all with a CLOSE_WAIT status.
  2. Since the default Max open files configuration for the heka daemon is set to 1024 and 4096 (soft, hard), the Heka process starts to fail to open any new connections or files once it reached the this limit. I can see the output of lsof -p [heka's PID] includes over a thousand open files (socket files).
  3. Errors logs start accumulating, which bloats the daemon's log file. For example, Input 'SyslogInput' error: open /var/log/syslog: too many open files (which is from a different input).
  4. Heka considers that the Kafka cluster is in the middle of an election and seizes to open new connections, throwing this error in the log kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.

Apparently, Heka is not cleaning up failed connections to the Kafka cluster. Can you please help?

rustamk commented 9 years ago

We are seeing the same issue in our testing.

Just a hope that fix will contain a check on kafka to see when it comes back to life to resume sending data.

rustamk commented 9 years ago

Is there any update on this issue by any chance?

trixpan commented 9 years ago

@muhammad-emam, @rustamk

Kafka output has some issues around redundancy in case of dead clusters. I stepped back from using it once I was unable to get the hekad processes to survive a disconnect from the cluster without losing messages.

AdeMiller commented 9 years ago

I'm seeing something similar for the Kafka input. The net result is high CPU and memory usage and Heka becoming wedged.

2015/11/09 15:28:03 Input 'kafka_input_de1_ie_canary_015' error: kafka server: In the middle of a leadership election, t
here is currently no leader for this partition and hence it is unavailable for writes.
2015/11/09 15:28:03 Input 'kafka_input_ne1_ie_zenoss_015' error: kafka server: In the middle of a leadership election, t
here is currently no leader for this partition and hence it is unavailable for writes.
2015/11/09 15:28:03 Input 'kafka_input_ca2_analytics_replication_011' error: kafka server: In the middle of a leadership
 election, there is currently no leader for this partition and hence it is unavailable for writes.
2015/11/09 15:28:11 Diagnostics: 63 packs have been idle more than 120 seconds.
2015/11/09 15:28:11 Diagnostics: (input) Plugin names and quantities found on idle packs:
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_platform_nlog_00: 1
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ateam_appfog_00: 4
2015/11/09 15:28:11 Diagnostics:    topic_stats_short: 63
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_04: 7
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_07: 4
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_05: 4
2015/11/09 15:28:11 Diagnostics:    topic_stats_long: 63
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_02: 11
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_00: 7
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_01: 12
2015/11/09 15:28:11 Diagnostics:    location_stats_short: 63
2015/11/09 15:28:11 Diagnostics:    kafka_output: 32
2015/11/09 15:28:11 Diagnostics:    location_stats_long: 63
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_03: 9
2015/11/09 15:28:11 Diagnostics:    elasticsearch_output_ie_zenoss_06: 4
2015/11/09 15:28:11 
JozoVilcek commented 8 years ago

We just got bit by the connection leak. Quite an old issue and I see more for kafka support. Is this going to get some care any time soon? Or should we abandon using heka with kafka? What are the long term plans?

jakobkylberg commented 8 years ago

We've been hit by the connection leak in our environment too. Just as JozoVilcek in the comment above I would like to know what the plans are regarding this issue? It would be very much appreciated if this issue could get some attention.