Open ericleigh007 opened 6 months ago
I have done some more tests on this and my original hypothesis seems correct. Netherite is taking longer to start orchestrators and activiites.
Our code has to transmit this information down between orchestrators and activities because it is operating on cosmos data that we're cobbled together from various document types into a materialized object at the higher levels and is commonly used in the lower in let's say 10 places. Ergo, fetching it back from cosmos would be 10x more expensive than getting it once.
And for better or worse, we're not able to change that now, as we're expected to go live in less than a month.
@sebastianburckhardt you added blob storage to avoid transmitting large messages across Event hubs for me. Is there a configuration element we can tweak that would cause that logic to help us in our quest for lower latency here?
As a first step, I'm sure you'll ask me just how "large" our messages are at max, how large the average is, and I would turn that back to you to ask if there is something in you know of that could be used to report that info with little fuss.
🙏
I wouldn't mind if this were answered more quickly. I was wondering about examples of maybe tweaking internal or user-accessible settings, or maybe a compression algorithm to avoid event hubs being as much of a bottleneck for the case of larger payloads.
Hey @ericleigh007: sorry for the delay, it's been busy :-) .
My current follow-up idea is that this is due to the more scalable, but longer time it takes Netherite to push a message across Eventhub with its base throughput of only 1MB/sec (Per TU) as opposed to how long the Azure Storage backend takes to transmit queue entries.
This is an interesting theory, at a glance I see the logic, but I can't immediately confirm that this is the culprit. I do know that @sebastianburckhardt implemented a related optimization here (https://github.com/microsoft/durabletask-netherite/pull/275) but by scanning through the PR, I don't see any top-level configuration knobs.
As a first step, I'm sure you'll ask me just how "large" our messages are at max, how large the average is, and I would turn that back to you to ask if there is something in you know of that could be used to report that info with little fuss.
Good guess. It may be possible we already have telemetry for this - can you provide me with your app name, an affected instanceIDs, and timerange in UTC that can be used to dig deeper?
Thank you. i will collect these tomorrow and send them via your gmail if that's fine.
Thanks @ericleigh007: Yes, that's fine, go ahead. Just please tag me here once sent so that I can check my email in case it did not land on the inbox.
Hello @davidmrdavid , I have sent some information which I hope is of help to your gmail. Please shout back in email if you need anything else.
So @davidmrdavid and @sebastianburckhardt I know you said there are no knobs externally to turn, but I wonder if you can point me to something "internal" (I can build the source, you know), so that I can maybe find a better setting for what I'm doing.
I have a hypothesis that maybe the tradeoff point is not appropriate for our stuff, and maybe it gives Event Hubs more priority for the payload, thereby letting things take a few seconds to transfer. I'd probably rather use the blobs earlier, IF that won't blow our blob backend throughput limit.
I don't think it should, since I believe the old backend's throughput hit a wall because of lack of operations (counting scans) in storage queues.
Give me some ideas, please, thanks!
Thanks @ericleigh007: Just confirming I've seen the email btw. I'll look to carve out time to see the data.
but I wonder if you can point me to something "internal" (I can build the source, you know), so that I can maybe find a better setting for what I'm doing.
Just for transparency, Sebastian is currently out of office so unfortunately he won't be able to assist in the immediate term, and he's indeed the SME here who'd more readily be able to point you to something internal. But I'm nonetheless hoping we can make a dent here in the meantime.
I do think that studying this PR (https://github.com/microsoft/durabletask-netherite/pull/275/files) seems like our best bet. In there, I see the creation of two new utility classes used to deal with large messages:
I think there's performance knobs in each we could tweak. In the blobSender, I see these:
In particular, the variable maxEventHubsBatchEvents
seems to control whether or not we offload a large message to a blobBatch instead, as seen here: https://github.com/microsoft/durabletask-netherite/blob/f8b5634746585c92a59fa0019c39930925c28aa9/src/DurableTask.Netherite/TransportLayer/EventHubs/EventHubsSender.cs#L93-L95
So if you're trying to opt into the "blob optimization" earlier (I think that was your suggestion) then perhaps you can make maxEventHubsBatchEvents
smaller. Does that make sense?
On the BlobReceiver, I see there's some concurrency control in the semaphore (AsynchronousStorageReadMaxConcurrency), that limits the amount of azure storage concurrency, which perhaps can be naively increased as well:
David, thanks This is a balancing act, I think. The way I see it, now I have the choice between good throughput for orchestrator and activity input with the blobs, but higher transactions/second, and the slower throughput for that with event hubs themselves.
I was wondering if these blobs from Blob sender/receiver "COULD" be a different storage account, which might give us a way that could provide faster throughput without affecting the single storage account used for the other operations of the backend.
@ericleigh007: It may be technically possible to have a storage account just for blobs, but I worry that's a bigger change than it sounds, and even if I could provide it on a private package, I'm unsure I could officialize it in the main branch.
That said, I can definitely try to provide you with a private package that tweaks some of the perf knobs I described above. That said, to avoid mixing too many variables, I'd like to first make a bit more progress here (https://github.com/microsoft/durabletask-netherite/issues/383#issuecomment-2111735829) and once that error seems resolved (I think we're close), I'll look to provide some tweaks to these settings
@ericleigh007: I think it is perhaps too early to fine tune the parameters without some better insight in what is causing the delays. I have taken a look at the telemetry for the two-day period and the orchestration you mentioned:
let start = datetime("2024-04-29T20:00");
let end = datetime("2024-05-01T20:00");
let orchestrationName = "CollectingTransactionSetChangeOrchestrator";
It seems that 99% of all orchestrations finish within 4.26 seconds. The interesting question is what happens to the others? There is a pretty long tail (max execution time is about 15 seconds). Importantly though, these 'slow' orchestration are not a constant occurrence over the two days observed. Rather, they all originate from just three distinctive time periods, of lengths 20 minutes, 7 minutes, and 6 seconds respectively.
I still have to dig deeper to find out what is causing the slowness during those times, but I would be surprised if it is related to the EH throughput. I am suspecting that it is about bugs or other faults that cause partitions to restart.
My estimation is around 16MB for what we call the Collecting path (so orchestrators starting with Collecting....) smaller for the other flows, but I think 16MB IS a decent estimate for "frequent" use. We also have one or two orchestrators that take larger hunks of data, and one that is maybe as large as 50MB sometimes, which is a return of 25,000 "items" in a single Service Bus trigger.
I also did some quick experimentation with "hidden" compression / decompression and I wondered why the current persistence stores don't use this, or at least have it as an option.
@sebastianburckhardt: we don't have stress tests with large inputs such as the ones described above, do we? I suppose it would be easy to modify them to generate and use several MB of random data?
We have switched to Netherite for our production workloads and we are quite pleased with the results in our testing.
However, it appears that for some size payload, the latency of calling orchestrators and activities is greater than with the same code using the default backend and we'd like to explore if there are ways to get this more in line.
Our main codebase is cored in Change feeds from Cosmos DB and the change feed immediately calls an orchestrator that then calls activity based on the state of the document. I have taken some basic A/B data for the time the change feed function takes and it appears the timing is longer for Netherite.
This is a simple test without being able to properly isolate all the factors, but wanted to know if any of you, especially @sebastianburckhardt might know of a reason. The below graph illustrates the longer latencies. The blue line represents the request duration for our main change feed for the standard backend, while the red is the duration when Netherite is used. The Netherite durations are spread over a wider range and overall at least twice the duration of the standard backend, it would seem. The data is normalized so that the scale is percentage of the total requests.
Thanks!