Intermittent flakiness of v7.0 RC

danielmarbach commented 1 month ago

Describe the bug

It is still a bit early to say what actually causes these things to happen and might very well be that the problem ins somewhere in our implementation. But we (@bording and I) figured to give a heads-up in case it ends up being something in the client.

We have migrated to the latest RC of the v7 version and see some intermittent test failures

https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10852801383?pr=1446

For some reason, the consume isn't happening, which messes up all the other tests

The thing we might be seeing so far here would be if BasicPublishAsync task is somehow completing before the message has actually been fully sent and confirmed. This would then let the BasicGetAsync start before the message is in the queue. At first sight, we couldn't spot anything in the transport or test code that could account for that, but like I said we may have missed something.

It is failing on Linux, so it's not just a Windows thing: https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10854796632?pr=1446

Reproduction steps

Still investigating

Expected behavior

Publish working ;)

Additional context

No response

danielmarbach commented 1 month ago

We have a hunch the problem might be related to confirms tracking and we are restoring our homegrown approach we had before to verify.

lukebakken commented 1 month ago

The thing we might be seeing so far here would be if BasicPublishAsync task is somehow completing before the message has actually been fully sent and confirmed.

Can you point out the exact test that is failing? In your source code?

The new confirmation tracking code was added by @stebet (I think) so I'm pinging him here.

When you call BasicPublishAsync, yes, it will complete and you must then call one of the WaitForConfirm*Async methods to wait for the confirmation. Or, like you mention, handle the acks yourself and track outstanding messages yourself.

danielmarbach commented 1 month ago

Ah OK that's probably the missing link. Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?

The failing tests are visible in the CI runs. Sure I can link them later here

danielmarbach commented 1 month ago

Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?

Was this design chosen because of the multiple flag? // cc @stebet

danielmarbach commented 1 month ago

Ok, I think I understand now. The work followed an existing design and made that async. Because we had our own confirmation tracking in place, we never needed to use WaitFormConfirms before. For me though, by looking at the async nature of the APIs as of today, I would have found it more intuitive for a single publish to have that functionality built-in depending on whether the channel is in confirms mode or not. Now that I better understand the functionality, I do see though why the APIs have been separated. It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes.

lukebakken commented 1 month ago

It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes

Yep, I think that's the main idea, plus I think it makes it a bit "easier" for an application to handle the case of a timed-out confirmation, rather than a random exception while publishing or at some other point.

Let me know if there's anything I can do to to assist with your testing. Thanks a lot for giving the latest RC a spin!

danielmarbach commented 1 month ago

@bording and I discussed this further today, and we believe this is a legacy that should be removed from the API surface of the client. In an async world on a channel that has confirms enabled, the publish operation could simply wire up the task completion source and then await that instead of the ModelSend. The result of the operation (successful, failed or cancelled) can be propagated into the task completion source.

Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll. The current way is counterintuitive and error-prone.

We will look into opening a PR

lukebakken commented 1 month ago

Great, that makes a lot of sense.

Tornhoof commented 1 month ago

Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll.

Be careful that multiple publishes via Task.WhenAll might not keep the ordering.l of the initial list of publishes.

A System.Threading.Channel might solve that and multiple confirms could then simply be a part of an inner (synchronous) TryRead loop, i.e., if multiple publishes are enqueued synchronously, these could be then confirmed together. This would then improve the confirm delay for channels with higher publish load.

danielmarbach commented 1 month ago

Sure we can take this into account in the PR @Tornhoof

danielmarbach commented 1 month ago

@Tornhoof FYI we don't care about the order of publishes for our use cases that much, but I can see how that is something that might be something the users of this library care about, which I guess right now is fulfilled by having the channel wide semaphore ensuring consistent ordering. So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.

Tornhoof commented 1 month ago

So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.

I'd say User, but if publish automatically confirms (and single confirm is slow), then you'd need something like PublishMany(list) with multi confirm.

From my own Benchmarks while migrating to 7.0, without any confirm, the code published around 16k msg/s, with single confirm it was down to 800 and with an S.T.Channel doing multi-confirm after the synchronous TryRead Loop lifted it back to ~10k.

lukebakken commented 1 month ago

@danielmarbach @bording @Tornhoof - https://github.com/rabbitmq/rabbitmq-dotnet-client/issues/1682

danielmarbach commented 1 month ago

Now that things are split out into dedicated issues and PRs I will close this one. Brandon will update #1682 with some of the discussions and challenges we are having.

rabbitmq / rabbitmq-dotnet-client