minghuaw / azeventhubs

Unofficial Azure Event Hubs SDK over AMQP 1.0 for rust
4 stars 2 forks source link

Error while recovering connection #4

Open ondrowan opened 7 months ago

ondrowan commented 7 months ago

I've recently started using this library and after a couple of days of consuming messages from Event Hub I've got hit by the following errors:

[azeventhubs::amqp::amqp_connection_scope] [ERROR] Error closing connection during recovering: IllegalState [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error ending CBS session during recovering: IllegalState [azeventhubs::amqp::amqp_cbs_link] [ERROR] CBS authorization refresh failed: Local error: ExpectImmediateDetach [message_processor] [ERROR] Link closed by remote [message_processor] [ERROR] Link closed by remote [message_processor] [ERROR] Link closed by remote [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error closing connection during recovering: IllegalState [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error ending CBS session during recovering: IllegalState [message_processor] [ERROR] Link closed by remote [message_processor] [ERROR] Link closed by remote

I've looked into the code that seems to cause this and found these lines:

https://github.com/minghuaw/azeventhubs/blob/7b3e7f31f237492820b81a2a4971cfde18b09e7a/src/amqp/amqp_connection_scope.rs#L704-L717

https://github.com/minghuaw/azeventhubs/blob/7b3e7f31f237492820b81a2a4971cfde18b09e7a/src/amqp/amqp_connection_scope.rs#L667-L689

If the connection is closed, should the recover_connection function still try to close it anyway? There's a similar scenario with the CBS handle a couple of lines below the first snippet. It however doesn't seem like this is causing the error, it's just something weird I've found while investigating the problem.

Do you have any idea why this could be happening? Or is it perhaps something users should be handling manually?

minghuaw commented 7 months ago

Thank you for reporting this issue.

If the connection is closed, should the recover_connection function still try to close it anyway?

I agree, but I don't think this is the cause though. https://github.com/minghuaw/azeventhubs/blob/7b3e7f31f237492820b81a2a4971cfde18b09e7a/src/amqp/amqp_connection_scope.rs#L705 returns true when the underlying event loop for the AMQP connection has already ended, and try closing the connection anyway won't really do anything (logic-wise).

https://github.com/minghuaw/azeventhubs/blob/7b3e7f31f237492820b81a2a4971cfde18b09e7a/src/amqp/amqp_connection_scope.rs#L675-L677 only logs the error and won't really stops the recovery process.

By looking through the logs, it seems like the connection was indeed recovered. Otherwise, there won't be this line

[azeventhubs::amqp::amqp_connection_scope] [ERROR] Error ending CBS session during recovering: IllegalState

The problem seems to be that the remote (Azure Event Hubs) is actively closing the link, but from what I remember, it would be a different behavior if the auth failed. Let me set up some quick test and see if I can reproduce this issue

minghuaw commented 7 months ago

@ondrowan Was the consumer idling for a long time before you receive this error?

ondrowan commented 7 months ago

I’m pretty sure it was receiving at least 1 message per minute, but I can confirm tomorrow when I’m back in the office.

On Wed, 6 Dec 2023 at 22:39, minghuaw @.***> wrote:

@ondrowan https://github.com/ondrowan Was the consumer idling for a long time before you receive this error?

— Reply to this email directly, view it on GitHub https://github.com/minghuaw/azeventhubs/issues/4#issuecomment-1843732798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHOYDGOWSRASSWBCJP2A3YIDQ2JAVCNFSM6AAAAABAJGJCFCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTG4ZTENZZHA . You are receiving this because you were mentioned.Message ID: @.***>

minghuaw commented 7 months ago

I think Event Hub actively shutdown producer client that has been idling for 30mins, but I don't remember it doing the same for consumer client. I will also try having an idling consumer and see if it gets closed by the Event Hub

ondrowan commented 7 months ago

From what I’ve seen, it happened randomly. Sometimes the connection dropped after 10 minutes, sometimes it worked for days while the consumer received at least 1 message a minute.

On Wed, 6 Dec 2023 at 23:34, minghuaw @.***> wrote:

I think Event Hub actively shutdown producer client that has been idling for 30mins, but I don't remember it doing the same for consumer client. I will also try having an idling consumer and see if it gets closed by the Event Hub

— Reply to this email directly, view it on GitHub https://github.com/minghuaw/azeventhubs/issues/4#issuecomment-1843796862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHOYFPF7SX4JCBCPACT5DYIDXI7AVCNFSM6AAAAABAJGJCFCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTG44TMOBWGI . You are receiving this because you were mentioned.Message ID: @.***>

minghuaw commented 7 months ago

Would you mind sharing your Event Hub config so that I could try to replicate this behavior? You can send this to my email if you don't want that information public

minghuaw commented 7 months ago

BTW, I have just tested that failing the auth should give a different behavior

ondrowan commented 7 months ago

Here is the full log from today's crash: https://gist.github.com/ondrowan/880319e9592ccf1c6855ac99c5b307d1

The messages stopped being received around 22:19:54 while they were perodically queued into Event Hub from IoT Hub all this time. Restart then fetched hundreds of messages that were queued since the last received message's offset.

I'm not sure what settings are relevant, so I'll try to summarize our usage:

And this part might be actually important. A single binary is used to read both partitions with code that looks approximately like this:

// This is called from main
async fn process_messages() -> anyhow::Result<()> {
    let mut consumer_client = EventHubConsumerClient::new_from_connection_string(
        EventHubConsumerClient::DEFAULT_CONSUMER_GROUP_NAME,
        &config_vars.event_hub_connection_string,
        config_vars.event_hub_name.clone(),
        EventHubConsumerClientOptions::default(),
    )
    .await?;
    let partition_ids = consumer_client.get_partition_ids().await?;

    let mut tokio_handles = vec![];

    for partition_id in partition_ids {
        let tokio_handle = tokio::spawn(process_partition_messages(
            partition_id,
            event_hub_connection_string.clone(),
            event_hub_name.clone(),
            ...
        ));

        tokio_handles.push(tokio_handle);
    }

    for tokio_handle in tokio_handles {
        tokio_handle.await??;
    }

    Ok(consumer_client.close().await?)
}

async fn process_partition_messages(...) -> anyhow::Result<()> {
    let mut consumer_client = EventHubConsumerClient::new_from_connection_string(
        EventHubConsumerClient::DEFAULT_CONSUMER_GROUP_NAME,
        event_hub_connection_string,
        event_hub_name,
        EventHubConsumerClientOptions::default(),
    )
    .await?;

    log::info!(
        "Partition {partition_id} will start fetching messages from offset {}.",
        checkpoint.offset_
    );

    let options = ReadEventOptions::default();
    let mut stream = consumer_client
        .read_events_from_partition(&partition_id, starting_position, options)
        .await?;

    while let Some(message) = stream.next().await {
        ...
    }
}

I didn't realize I use a separate client to read the partition ids and then separate clients in each of the tasks. Could the former be the problem since it's used just for returning the partition ids and not reading of any messages?

minghuaw commented 7 months ago

Thanks for the information.

And this part might be actually important. A single binary is used to read both partitions with code that looks approximately like this:

This should be fine.

I didn't realize I use a separate client to read the partition ids and then separate clients in each of the tasks.

I think this should be fine as well. However, frankly speaking, I have not tested such scenario.

Let me look into it (might take me some time).

Meanwhile, if you want to have more custom control, you may consider using the underlying AMQP 1.0 crate directly (this potentially could give more detail on which part of the AMQP protocol was causing the problem). I have an example here (https://github.com/minghuaw/fe2o3-amqp/blob/main/examples/event_hubs/src/bin/simple_receiver.rs)

minghuaw commented 7 months ago

I have another "guess" after looking through the log. The line below is probably more interesting and could indicate a bug in the upstream crate.

[2023-12-06 20:29:55] [azeventhubs::amqp::amqp_cbs_link] [ERROR] CBS authorization refresh failed: Local error: ExpectImmediateDetach

The recovery error starts to occur roughly one hour after this error with CBS auth.

[2023-12-06 21:35:18] [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error closing connection during recovering: IllegalState

I vaguely remember that either AMQP 1.0 spec or Azure Event Hub suggests that the period for refreshing the token should be half of the actual timeout to avoid spurious timeouts. And the default period for refreshing the token is 30 mins, so the actual timeout for auth would be one hour.

The CBS auth error mentioned above would mean the token refresher won't be inserted back to the delay queue, and the token won't be refreshed ever after. This would explain why it starts to fail after roughly one hour.

The recovery should technically redo the auth though, which should avoid such a problem. I guess I will have to take a closer look.

minghuaw commented 7 months ago

@ondrowan Would you mind testing with an official SDK like the dotnet SDK? I'm just curious if the server would close the link with some useful error message

ondrowan commented 7 months ago

I have another "guess" after looking through the log. The line below is probably more interesting and could indicate a bug in the upstream crate.

[2023-12-06 20:29:55] [azeventhubs::amqp::amqp_cbs_link] [ERROR] CBS authorization refresh failed: Local error: ExpectImmediateDetach

The recovery error starts to occur roughly one hour after this error with CBS auth.

[2023-12-06 21:35:18] [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error closing connection during recovering: IllegalState

I vaguely remember that either AMQP 1.0 spec or Azure Event Hub suggests that the period for refreshing the token should be half of the actual timeout to avoid spurious timeouts. And the default period for refreshing the token is 30 mins, so the actual timeout for auth would be one hour.

The CBS auth error mentioned above would mean the token refresher won't be inserted back to the delay queue, and the token won't be refreshed ever after. This would explain why it starts to fail after roughly one hour.

The recovery should technically redo the auth though, which should avoid such a problem. I guess I will have to take a closer look.

I looked at the previous logs and it doesn't seem like this is always the case though. I cannot really see any pattern in it. The gap between these 2 errors isn't always ~1 hour. And sometimes these errors start 10 min after startup, sometimes after days or hours. E.g. today I restarted the service at 1:30AM and 15 hours later it works without problems. Yesterday I had to restart it a couple of times, sometimes after 10 minutes of running, sometimes after an hour.

What I forgot to mention previously is that sometimes the error that is spamming over and over again is "Link closed by remote" and sometimes "Session has dropped". This seems to be random as well and I'm not sure what it could mean.

@ondrowan Would you mind testing with an official SDK like the dotnet SDK? I'm just curious if the server would close the link with some useful error message

I can try to write some test code using Python SDK, but as I said, this doesn't seem to be easy to reproduce.

minghuaw commented 7 months ago

"Link closed by remote" and sometimes "Session has dropped".

Since Azure Event Hubs usually creates only one link per session, these two errors are likely caused by the same thing, the remote somehow chooses to closes the link and the associated session is closed together.

I have not been able to reproduce this behavior on my own machine unfortunately. Would it be possible to record some log with a log level of Debug the next time you have to restart the service? I will keep trying to reproduce the problem in the meantime.

ondrowan commented 7 months ago

I've set it up and will let you know if anything relevant pops up overnight.

ondrowan commented 7 months ago

Should I enable some feature flags for all logs to be available or does it happen implicitly when using this crate? I noticed fe2o3-amqp enables some debug macros with log feature.

minghuaw commented 7 months ago

Should I enable some feature flags for all logs to be available or does it happen implicitly when using this crate? I noticed fe2o3-amqp enables some debug macros with log feature.

I don't think this is necessary. azeventhubs already uses the log crate, and this seems also enable the log feature in fe2o3-amqp (which is simply an optional dependency)

ondrowan commented 6 months ago

It's been almost a day and so far there are no errors in logs. I'm starting to think it's been some hiccup on Azure's side. Let's see what happens over the weekend.

Stupremee commented 6 months ago

Hey @ondrowan and @minghuaw

I'm currently facing the same issue, and it does not seem to be fixed, as I encountered it today. Over the last two weeks we've been using your crate. Unfortunaly, we see these kinds of errors often. As context, we are using IoT Hub and I'm using this crate to connect to it.

Additionally, our workflow is the following, we start the backend service, which connects to the Event Hub endpoint of IoT Hub, but does not subscribe to any messages. Then, when we perform certain API calls, a subscription stream will be started. So it may happen that the connection is in idle for many hours, before doing it's first subscription

This is what our log looks like. image

You can ignore the INFO in between there. This is just our own IoT Hub API Client that renews it's SAS. The latter ERROR is logged when trying to start a consumer stream.

PS: I just started our service using debug logs, so I can share more logs when it occurrs again!

minghuaw commented 6 months ago

Hi @Stupremee thanks for reporting the issue. I think there might be some tiny differences in how an Event Hubs client vs. IoTHub client should be configured, which unfortunately i haven't looked into yet. I will take a look at their dotnet SDK later today and get back to you if i find anything.

minghuaw commented 6 months ago

@ondrowan One thing, for the first consumer that you opened just to get the partition_ids, it could be closed soon after you have obtained the partition id. Otherwise Azure will close it after it being idle for more than 5 mins. However, this should not affect other consumers.

I am still investigating but I have not been able to reproduce the problem yet. I will keep you folks updated

minghuaw commented 6 months ago

BTW, you have the option to read from all partitions on a single consumer in case you don't know. https://docs.rs/azeventhubs/latest/azeventhubs/consumer/struct.EventHubConsumerClient.html#method.read_events

minghuaw commented 6 months ago

@Stupremee do you happen to be using a North Europe instance?

Stupremee commented 6 months ago

@Stupremee do you happen to be using a North Europe instance?

No, the location is Germany West Central.

I don't know if this helps you, but that's the log with debug level. image

ondrowan commented 6 months ago

BTW, you have the option to read from all partitions on a single consumer in case you don't know. https://docs.rs/azeventhubs/latest/azeventhubs/consumer/struct.EventHubConsumerClient.html#method.read_events

I save the last processed offset for each partition into database and then load it in case the consumer is restarted. It's not possible to set starting position when using this method, so I couldn't use it.

minghuaw commented 6 months ago

@Stupremee do you happen to be using a North Europe instance?

No, the location is Germany West Central.

I don't know if this helps you, but that's the log with debug level. image

It seems like you are already closing the event stream. However, closing the stream should not yield those errors. You may see something like IllegalConnectionState when you close the consumer client.

I have been trying the replicate the issue on a Canada Central instance, but I have not been unable to. I will try either a North Europe or a Germany West Central instance and see if I can get the same behavior

minghuaw commented 6 months ago

Additionally, our workflow is the following, we start the backend service, which connects to the Event Hub endpoint of IoT Hub, but does not subscribe to any messages. Then, when we perform certain API calls, a subscription stream will be started. So it may happen that the connection is in idle for many hours, before doing it's first subscription

@Stupremee One possibility is that the Event Hub endpoint for IoT Hub only allows a max idling time of 5 mins, so you may want to consider closing the idling client and then starting up a new client for subscription. I will need to test such scenario and see if I get the same error

Stupremee commented 6 months ago

Additionally, our workflow is the following, we start the backend service, which connects to the Event Hub endpoint of IoT Hub, but does not subscribe to any messages. Then, when we perform certain API calls, a subscription stream will be started. So it may happen that the connection is in idle for many hours, before doing it's first subscription

@Stupremee One possibility is that the Event Hub endpoint for IoT Hub only allows a max idling time of 5 mins, so you may want to consider closing the idling client and then starting up a new client for subscription. I will need to test such scenario and see if I get the same error

Alright, I will change to always open a new connection. This should be fine for me too.

ondrowan commented 6 months ago

@Stupremee do you happen to be using a North Europe instance?

No, the location is Germany West Central. I don't know if this helps you, but that's the log with debug level. image

It seems like you are already closing the event stream. However, closing the stream should not yield those errors. You may see something like IllegalConnectionState when you close the consumer client.

I have been trying the replicate the issue on a Canada Central instance, but I have not been unable to. I will try either a North Europe or a Germany West Central instance and see if I can get the same behavior

The problem is it seems to be really hard to replicate. At first it kept happening every couple of hours, then when I finally added some more logging it hasn't happened for 2 days and now it's been working correctly for 3 days straight. I have no idea what could've changed on my side or what could be triggering it.

At this point, as a hotfix, I'd like to at least know how to find out when it has happened and reconnect.

minghuaw commented 6 months ago

At this point, as a hotfix, I'd like to at least know how to find out when it has happened and reconnect.

@ondrowan I think one temporary workaround could be setting max_retries to just zero (it defaults to 3, https://docs.rs/azeventhubs/latest/azeventhubs/struct.EventHubsRetryOptions.html#structfield.max_retries), so that it simply fails fast and does not attempt to recover. The you can manually capture that error and start up a new client.

The retry option can be configured in the EventHubConsumerClientOptions.

minghuaw commented 6 months ago

I think I have found the problem with @Stupremee 's use case. The problem was that the AMQP management link, unlike the regular sender/receiver link, cannot be resumed on a new AMQP session (but this somehow works fine with the regular Event Hub). Given that the AMQP management extension is still a working draft, such behavior is somewhat understandable. The fix is simple, instead of trying to resume the management link on a new session, we will just open new session and new management link during the recovery.

This problem could influence @ondrowan 's use case during the recovery as well. But since I don't really have a deterministic way to reproduce that error. I cannot be sure that it is caused by the same problem.

What versions of the crates are you @ondrowan @Stupremee working with? I will definitely include this fix in the upcoming "0.18.0" release, but if you have to stick to "0.17" for some reason, I could do a 0.17.1 for this fix.

Stupremee commented 6 months ago

What versions of the crates are you @ondrowan @Stupremee working with? I will definitely include this fix in the upcoming "0.18.0" release, but if you have to stick to "0.17" for some reason, I could do a 0.17.1 for this fix.

I'm using 0.17 right now, but I have to no problem upgrading. You can tell me when you have the fix pushed, then I can test it using the Git Version.

Also thanks very much for fixing it so fast! (hopefully :) )

ondrowan commented 6 months ago

I'm using 0.17 as well and have no problem upgrading to 0.18.

minghuaw commented 6 months ago

I have just pushed 0.17.1, which already includes the fix and you can have a try. And I am going to push 0.18.0 in maybe an hour. I will probably make a 0.17.2 before releasing 0.18 though. Right now, read_events uses the recovery mechanism but read_events_from_partition does not. I want to make them consistent.

minghuaw commented 6 months ago

Please use "0.18.0". I have yanked "0.17.1" and "0.17.2" because I mistakenly used azore_core = "0.18" in those two releases, and this conflicts with the versioning.

ondrowan commented 6 months ago

I just upgraded and it crashed right away after maybe 1 hour of running. I got the debug log level enabled and this is being spammed over and over again:

[2023-12-14 20:46:01] [azeventhubs::amqp::amqp_consumer::sin gle] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpEr ror(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened.") , info: None })))

minghuaw commented 6 months ago

Interesting, I have never received this error. The error message is a lot more useful this time.

minghuaw commented 6 months ago

@ondrowan I have a released a quick patch "0.18.1", and the changes can be found in PR #24. Do you mind sharing a full log? It seems like the client is trying to recover while the connection was not interrupted. I wonder what was the error that caused the retry.

minghuaw commented 6 months ago

Because AMQP management is still working draft, there is no defined behavior for these entities. So the current fix may fail if Azure changes how it handles management links. https://github.com/minghuaw/azeventhubs/issues/24#issuecomment-1857087490

ondrowan commented 6 months ago

@ondrowan I have a released a quick patch "0.18.1", and the changes can be found in PR #24. Do you mind sharing a full log? It seems like the client is trying to recover while the connection was not interrupted. I wonder what was the error that caused the retry.

Unfortunatelly, there's not much in the logs. That's why I haven't pasted anything before.

[2023-12-14 20:44:58] [message_processor] [INFO] 5 messages were processed in the past minute. [2023-12-14 20:45:58] [message_processor] [INFO] 4 messages were processed in the past minute. [2023-12-14 20:46:01] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:05] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:11] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:11] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote [2023-12-14 20:46:13] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:16] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:23] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:23] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote [2023-12-14 20:46:25] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:28] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:35] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:35] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote [2023-12-14 20:46:36] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:40] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:46] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:46] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote [2023-12-14 20:46:48] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:52] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None }))) [2023-12-14 20:46:58] [azeventhubs::amqp::amqp_consumer::single] [ERROR] Failed to recover client: SenderResume(AttachError(RemoteClosedWithError(Error { condition: AmqpError(NotAllowed), description: Some("A link to connection '40123453' $management node has already been opened."), info: None })))

There's one additional thing I have noticed. I have 2 different instances of a service that's polling 2 different Event Hub instances. But these errors happen approximately at the same time, there's always just a couple of minutes of difference from what I've seen in the logs. E.g. Instance 1 was running with 0.17.0 since last Saturday without any problems. The same was true for instance 2 until I upgraded it to 0.18.0 and. Then after a couple of hours both of the instances errored out with just 5 minutes of difference. Could that mean that some Event Hub deployments were happening on Azure side and that's why the management connection was severed in the first place?

minghuaw commented 6 months ago

Somehow it looks like Azure is closing the receiver link but not the connection. I should probably add more debug logging to the receiver (probably in 0.18.2)

minghuaw commented 6 months ago

I have added a bit more debug logs in "0.18.2", and the management link will attempt recovery anyway even if the underlying session is still open (there is no guarantee that the session and the link will be closed altogether). The full changes can be found in PR #25

ondrowan commented 6 months ago

I've updated to 0.18.2 and will see what happens.

Meanwhile here's a more detailed log from today's crash (with 0.18.1):

[2023-12-16 09:08:07] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:08:07] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:08:07] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:08:07] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:08:10] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:08:10] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:08:10] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:08:10] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:08:17] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:08:17] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:08:17] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:08:17] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:08:17] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote [2023-12-16 09:08:18] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:08:18] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:08:18] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:08:18] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:08:22] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:08:22] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:08:22] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:08:22] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:08:28] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:08:28] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:08:28] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:08:28] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:08:28] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote

...

[2023-12-16 09:13:02] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:13:02] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:13:02] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:13:05] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:13:05] [azeventhubs::amqp::amqp_connection_scope] [DEBUG] Recovering connection [2023-12-16 09:13:05] [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error closing connection during recovering: RemoteClosedWithError(Error { condition: ConnectionError(ConnectionForced), description: Some("The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:b86740c4462e4c609ea736231ba50b10_G12, SystemTracker:gateway5, Timestamp:2023-12-16T09:13:05"), info: None }) [2023-12-16 09:13:06] [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error ending CBS session during recovering: IllegalConnectionState [2023-12-16 09:13:06] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:13:06] [azeventhubs::amqp::amqp_management_link] [ERROR] Found error closing old management client during recovery: IllegalSessionState [2023-12-16 09:13:06] [azeventhubs::amqp::amqp_management_link] [ERROR] Found error closing old management session during recovery: IllegalConnectionState [2023-12-16 09:13:06] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 09:13:06] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:13:12] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:13:12] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:13:12] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:13:12] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:13:12] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote

From what I've seen in the logs, only processing of messages from partition 1 crashed. Messages from partition 0 kept being processed. The log from 09:13:05 indicates that perhaps there weren't any messages received from a partition (or the management link?), but from what I've checked there's at least 1 message enqueued to each partition at least once a minute.

ondrowan commented 6 months ago

I looked into previous logs a bit more and found also this piece that might be helpful:

[2023-12-16 09:28:38] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:28:38] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:28:38] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:28:38] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_connection_scope] [DEBUG] Recovering connection [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error closing connection during recovering: RemoteClosedWithError(Error { condition: ConnectionError(ConnectionForced), description: Some("The connection was inactive for more than the allowed 60000 milliseconds and is closed by container 'LinkTracker'. TrackingId:9e105933ec0148f0b187d6f05ad4c555_G19, SystemTracker:gateway5, Timestamp:2023-12-16T09:28:40"), info: None }) [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_connection_scope] [ERROR] Error ending CBS session during recovering: IllegalConnectionState [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_management_link] [ERROR] Found error closing old management client during recovery: IllegalSessionState [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_management_link] [ERROR] Found error closing old management session during recovery: IllegalConnectionState [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 09:28:42] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:28:49] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 09:28:49] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 09:28:49] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management session is not ended, skipping recovery [2023-12-16 09:28:49] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 09:28:49] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote

ondrowan commented 6 months ago

Here are logs with 0.18.2:

[2023-12-16 14:49:26] [message_processor] [INFO] 6 messages were processed in the past minute. [2023-12-16 14:50:26] [message_processor] [INFO] 8 messages were processed in the past minute. [2023-12-16 14:51:26] [message_processor] [INFO] 5 messages were processed in the past minute. [2023-12-16 14:52:26] [message_processor] [INFO] 7 messages were processed in the past minute. [2023-12-16 14:52:50] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: Receive(LinkStateError(RemoteClosed)) [2023-12-16 14:52:52] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 14:52:52] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 14:52:52] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link session is still open, performing recovery anyway [2023-12-16 14:52:52] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 14:52:52] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 14:52:52] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: LinkDetach(ClosedByRemote) [2023-12-16 14:52:55] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 14:52:55] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 14:52:55] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link session is still open, performing recovery anyway [2023-12-16 14:52:55] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 14:52:55] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 14:52:55] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: LinkDetach(ClosedByRemote) [2023-12-16 14:53:02] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 14:53:02] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 14:53:02] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link session is still open, performing recovery anyway [2023-12-16 14:53:02] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 14:53:02] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 14:53:02] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: LinkDetach(ClosedByRemote) [2023-12-16 14:53:02] [message_processor] [ERROR] Error while reading the message from partition 1. Caused by: 0: Link closed by remote 1: Link closed by remote [2023-12-16 14:53:02] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: Receive(LinkStateError(IllegalSessionState)) [2023-12-16 14:53:04] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 14:53:04] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 14:53:04] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link session is still open, performing recovery anyway [2023-12-16 14:53:04] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 14:53:04] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 14:53:04] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: LinkDetach(ClosedByRemote) [2023-12-16 14:53:07] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 14:53:07] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 14:53:07] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link session is still open, performing recovery anyway [2023-12-16 14:53:07] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 14:53:07] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered [2023-12-16 14:53:07] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: LinkDetach(ClosedByRemote) [2023-12-16 14:53:14] [azeventhubs::amqp::amqp_client] [DEBUG] Recovering client [2023-12-16 14:53:14] [azeventhubs::amqp::amqp_management_link] [DEBUG] Recovering management link [2023-12-16 14:53:14] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link session is still open, performing recovery anyway [2023-12-16 14:53:14] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link recovered [2023-12-16 14:53:14] [azeventhubs::amqp::amqp_client] [DEBUG] Client recovered

minghuaw commented 6 months ago

It looks like this line is the error that caused the retry mechanism.

[2023-12-16 14:52:50] [azeventhubs::amqp::amqp_consumer::single] [DEBUG] Failed to receive event: Receive(LinkStateError(RemoteClosed))

Azure closed the receiver link for no reason, and the line below indicates that the connection is kept open though (unlike in the case of idling where the connection, sessions and links are closed altogether with an error).

[2023-12-16 14:52:52] [azeventhubs::amqp::amqp_management_link] [DEBUG] Management link session is still open, performing recovery anyway

I am not sure how Azure behaves when an instance runs out of its daily limit. Let me test this out

minghuaw commented 6 months ago

I am not sure how Azure behaves when an instance runs out of its daily limit. Let me test this out

My test receiver doesn't get disconnected even when the daily limit for the test IoT Hub instance is exceeded.

ondrowan commented 6 months ago

I am not sure how Azure behaves when an instance runs out of its daily limit. Let me test this out

I'm using the paid tier and it's nearly impossible to go over the limit with this setup anyway.

minghuaw commented 6 months ago

Do you get the same error when you use any of the official SDK? I haven't been keeping track of the changes to the official SDKs.

minghuaw commented 6 months ago

I'll open an issue on their official SDK for dotnet repo and see if they have any thoughts. They have been quite helpful when I started working on this crate

minghuaw commented 6 months ago

@ondrowan Were you able to receive message if you manually starting up a completely new consumer client after all the retrying fails? The current problem, after looking thru the latest logs, look quite similar to the problem with management link that we mentioned before.

The problem was that the AMQP management link, unlike the regular sender/receiver link, cannot be resumed on a new AMQP session (but this somehow works fine with the regular Event Hub).

The error ClosedByRemote also suggests that the link cannot be resumed (cuz Azure closed the link by sending a closing Detach). This is one of my oversight. I will work on a patch for this.