hdinsight / hdinsight-storm-examples

This is a repository for complete and easy to use samples that demonstrate the use of Apache Storm on HDInsight
Apache License 2.0
58 stars 52 forks source link

Connection closed for unknown reason #13

Closed BennyM closed 8 years ago

BennyM commented 8 years ago

Lately I've been having a lot of issues running topologies. Two errors occur quite often, giving no indication as to what's wrong.

2015-08-12 08:58:08 b.s.d.executor [ERROR] 
java.lang.RuntimeException: com.microsoft.eventhubs.client.EventHubException: org.apache.qpid.amqp_1_0.client.ConnectionClosedException: Connection closed for unknown reason
    at com.microsoft.eventhubs.spout.EventHubSpout.open(EventHubSpout.java:156) ~[stormjar.jar:na]
    at backtype.storm.daemon.executor$fn__5064$fn__5079.invoke(executor.clj:542) ~[storm-core-0.9.3.2.2.7.1-0004.jar:0.9.3.2.2.7.1-0004]
    at backtype.storm.util$async_loop$fn__550.invoke(util.clj:463) ~[storm-core-0.9.3.2.2.7.1-0004.jar:0.9.3.2.2.7.1-0004]
    at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
Caused by: com.microsoft.eventhubs.client.EventHubException: org.apache.qpid.amqp_1_0.client.ConnectionClosedException: Connection closed for unknown reason
    at com.microsoft.eventhubs.client.EventHubConsumerGroup.ensureSessionCreated(EventHubConsumerGroup.java:64) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.client.EventHubConsumerGroup.createReceiver(EventHubConsumerGroup.java:39) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.client.ResilientEventHubReceiver.initialize(ResilientEventHubReceiver.java:63) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.EventHubReceiverImpl.open(EventHubReceiverImpl.java:74) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.SimplePartitionManager.open(SimplePartitionManager.java:77) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.EventHubSpout.preparePartitions(EventHubSpout.java:134) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.EventHubSpout.open(EventHubSpout.java:153) ~[stormjar.jar:na]
    ... 4 common frames omitted
Caused by: org.apache.qpid.amqp_1_0.client.ConnectionClosedException: Connection closed for unknown reason
    at org.apache.qpid.amqp_1_0.client.Connection.checkNotClosed(Connection.java:338) ~[stormjar.jar:na]
    at org.apache.qpid.amqp_1_0.client.Connection.createSession(Connection.java:322) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.client.EventHubConsumerGroup.ensureSessionCreated(EventHubConsumerGroup.java:61) ~[stormjar.jar:na]
    ... 10 common frames omitted
2015-08-12 08:59:44 b.s.d.executor [ERROR] 
java.lang.NullPointerException: null
    at org.apache.qpid.amqp_1_0.transport.ConnectionEndpoint.getFirstFreeChannel(ConnectionEndpoint.java:327) ~[stormjar.jar:na]
    at org.apache.qpid.amqp_1_0.transport.ConnectionEndpoint.createSession(ConnectionEndpoint.java:230) ~[stormjar.jar:na]
    at org.apache.qpid.amqp_1_0.client.Session.<init>(Session.java:58) ~[stormjar.jar:na]
    at org.apache.qpid.amqp_1_0.client.Connection.createSession(Connection.java:323) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.client.EventHubConsumerGroup.ensureSessionCreated(EventHubConsumerGroup.java:61) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.client.EventHubConsumerGroup.createReceiver(EventHubConsumerGroup.java:39) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.client.ResilientEventHubReceiver.initialize(ResilientEventHubReceiver.java:63) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.EventHubReceiverImpl.open(EventHubReceiverImpl.java:74) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.SimplePartitionManager.open(SimplePartitionManager.java:77) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.EventHubSpout.preparePartitions(EventHubSpout.java:134) ~[stormjar.jar:na]
    at com.microsoft.eventhubs.spout.EventHubSpout.open(EventHubSpout.java:153) ~[stormjar.jar:na]
    at backtype.storm.daemon.executor$fn__5064$fn__5079.invoke(executor.clj:542) [storm-core-0.9.3.2.2.7.1-0004.jar:0.9.3.2.2.7.1-0004]
    at backtype.storm.util$async_loop$fn__550.invoke(util.clj:463) [storm-core-0.9.3.2.2.7.1-0004.jar:0.9.3.2.2.7.1-0004]
    at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
ravitandonrt commented 8 years ago

@BennyM Thanks for reporting and using our service. One of the reasons I know where can happen is if your connection is being throttled. These errors are usually retried i.e. the spout task will re-establish the connection if its dropped. Did you see that happen?

The error message should certainly be improved to make that clear (if that's the case). I know that it is much more clearer in the C# client for EventHubs.

Understanding the EventHubs throughput units

http://azure.microsoft.com/en-us/pricing/details/event-hubs/ Check out the FAQ section in above article on how the throttling is enforced.

EventHubs has a concept of throughput units that you can find under the "scale" tab in your namespace in Azure Portal. Each unit means 1 MB/s ingress, 2MB/s egress.

This bandwidth is shared across your entire namespace not a single EventHubs. So if you have multiple services underneath this namespace, they will share this throughput limit.

You can go increase upto 20 throughput units in the Azure Portal, which gives you 20x throughput across your name space. In a namespace with a single EvenHubs of 8 partitions, one should be able to get roughly 2.5 MB/s ingress, 5 MB/s egress in a partition. A single partition cannot go beyond 5 MB/s.

As the number of partitions cannot be changed once an EventHubs is created its best to create the partitions based on categorization of your data. The scaling should be handled via the throughput units and increasing them as your application scales.

On contacting Azure support the number of partitions and the throughput (in blocks of 20) can be increased to larger numbers like 128 partitions and 100 units if you have higher needs than usual.

Should you be interested, EventHubs can also provide throughput upto 1 GB/s through enterprise contracts.

Hope this helps, let me know if you have follow-up questions in this regard. I will also bring this in notice to the EventHubs team and create a wiki around it.

How to troubleshoot if it was indeed throttling (updated post @BennyM blog post)

You should start by taking a look at your EventHubs dashboard in Azure Portal. It should give you an statistical idea of how the your EventHubs are doing in the past hour.

Note to others

@BennyM seems to have solved this problem by figuring this out himself after he opened the issue. Take a look this this blog post by @BennyM that talks more about the throughput units of EventHubs: http://blog.bennymichielsen.be/2015/08/11/scaling-an-azure-event-hub-throughput-units/

Further read

I will also suggest visitors at read about EventHubs performance in this blog by @shanyu: http://blogs.msdn.com/b/shanyu/archive/2015/05/14/performance-tuning-for-hdinsight-storm-and-microsoft-azure-eventhubs.aspx

Leaving the issue open

I am leaving this issue open for now if something can be improved in this regard, among several options like:

  1. Better error messages from EventHubs/Amqp
  2. Perhaps a change of protocols other than Amqp
  3. Detecting throttling changes on client side, fail and recover gracefully
ravitandonrt commented 8 years ago

If there are issues with connections, we recommend to use newer storm-eventhubs from the hdinsight repository for now. These changes will make it to Apache Storm as well.

A new client for azure-event-hubs is also available now, which should hopefully address connectivity issues better. Right now, I have opened this issue in storm-eventhubs to track that ask: https://github.com/hdinsight/storm-eventhubs/issues/11