Closed danielmarbach closed 1 month ago
We have a hunch the problem might be related to confirms tracking and we are restoring our homegrown approach we had before to verify.
The thing we might be seeing so far here would be if BasicPublishAsync task is somehow completing before the message has actually been fully sent and confirmed.
Can you point out the exact test that is failing? In your source code?
The new confirmation tracking code was added by @stebet (I think) so I'm pinging him here.
When you call BasicPublishAsync
, yes, it will complete and you must then call one of the WaitForConfirm*Async
methods to wait for the confirmation. Or, like you mention, handle the acks yourself and track outstanding messages yourself.
Ah OK that's probably the missing link. Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?
The failing tests are visible in the CI runs. Sure I can link them later here
Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?
Was this design chosen because of the multiple
flag? // cc @stebet
Ok, I think I understand now. The work followed an existing design and made that async. Because we had our own confirmation tracking in place, we never needed to use WaitFormConfirms
before. For me though, by looking at the async nature of the APIs as of today, I would have found it more intuitive for a single publish to have that functionality built-in depending on whether the channel is in confirms mode or not. Now that I better understand the functionality, I do see though why the APIs have been separated. It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes.
It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes
Yep, I think that's the main idea, plus I think it makes it a bit "easier" for an application to handle the case of a timed-out confirmation, rather than a random exception while publishing or at some other point.
Let me know if there's anything I can do to to assist with your testing. Thanks a lot for giving the latest RC a spin!
@bording and I discussed this further today, and we believe this is a legacy that should be removed from the API surface of the client. In an async world on a channel that has confirms enabled, the publish operation could simply wire up the task completion source and then await that instead of the ModelSend
. The result of the operation (successful, failed or cancelled) can be propagated into the task completion source.
Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll
. The current way is counterintuitive and error-prone.
We will look into opening a PR
Great, that makes a lot of sense.
Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll.
Be careful that multiple publishes via Task.WhenAll
might not keep the ordering.l of the initial list of publishes.
A System.Threading.Channel might solve that and multiple confirms could then simply be a part of an inner (synchronous) TryRead
loop, i.e., if multiple publishes are enqueued synchronously, these could be then confirmed together. This would then improve the confirm delay for channels with higher publish load.
Sure we can take this into account in the PR @Tornhoof
@Tornhoof FYI we don't care about the order of publishes for our use cases that much, but I can see how that is something that might be something the users of this library care about, which I guess right now is fulfilled by having the channel wide semaphore ensuring consistent ordering. So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.
So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.
I'd say User, but if publish automatically confirms (and single confirm is slow), then you'd need something like PublishMany(list) with multi confirm.
From my own Benchmarks while migrating to 7.0, without any confirm, the code published around 16k msg/s, with single confirm it was down to 800 and with an S.T.Channel doing multi-confirm after the synchronous TryRead Loop lifted it back to ~10k.
@danielmarbach @bording @Tornhoof - https://github.com/rabbitmq/rabbitmq-dotnet-client/issues/1682
Now that things are split out into dedicated issues and PRs I will close this one. Brandon will update #1682 with some of the discussions and challenges we are having.
Describe the bug
It is still a bit early to say what actually causes these things to happen and might very well be that the problem ins somewhere in our implementation. But we (@bording and I) figured to give a heads-up in case it ends up being something in the client.
We have migrated to the latest RC of the v7 version and see some intermittent test failures
https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10852801383?pr=1446
For some reason, the consume isn't happening, which messes up all the other tests
The thing we might be seeing so far here would be if
BasicPublishAsync
task is somehow completing before the message has actually been fully sent and confirmed. This would then let theBasicGetAsync
start before the message is in the queue. At first sight, we couldn't spot anything in the transport or test code that could account for that, but like I said we may have missed something.It is failing on Linux, so it's not just a Windows thing: https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10854796632?pr=1446
Reproduction steps
Still investigating
Expected behavior
Publish working ;)
Additional context
No response