Add handling of summary nacks error codes and retryAfter

vladsud commented 3 years ago

Core of the problem: Latest ODSP stress test run exposed a problem where PUSH might be throttled heavily by SPO but client is not. As result, it shows up in two types of behaviors:

PUSH nacks summary and client immediately retries, as client does not have enough information on reason for the nack. We always retry by uploading new summary (as opposed to issuing summary op for already uploaded summary), making things worse from payload perspective.
PUSH sits in retry loop (as it got 429 from SPO), but client times out (current time out is 2 minutes) and retries, sending more summaries on top of current pending one.

Reason PUSH is throttled more than client is because PUSH has to do more work - it flushes ops, as well as SPO do more expensive work on commit of .protocol part of summary (GC of previous summaries / blobs).

Suggested solution:

In both cases PUSH to nack summary and provide error code (429) with retryAfter (i.e. make summary nacks more like regular nacks on socket).
Clint to respect retryAfter and do nothing for that period of time
- This should apply to next summarizer if summarizer changes in this time frame, though clearly it's more nuanced as we do not have synchronized clocks across clients / servers.

Note that client always fetches latest version of document on "safe" summary path, so it's safe for PUSH to send a nack when in reality it is retrying and summary may still succeed.

The gap in this solution - PUSH can't commit summary in timeframe it promised (through retryAfter). In this case we will get same behavior as today, but on much slower cadence. We may need to tweak it in the future by submitting nacks continuously for same summary as a way to communicate to client that it should keep waiting.

The other part of the problem that is not solved - slowing down summaries will eventually result in too many unsummarized ops and kicking out regular clients. Given clients know how far summarizer is behind, they might in the future to more proactively disconnect or move container into read-only state to reduce workload on the system.

Alternative mitigation We may raise summary nack timeout to 10 minutes. SPO usually does not go above 5 min for 429 errors, so unless PUSH is continuously throttled, there good chance it will solve big part of the problem. Client may still issue telemetry errors on 2 min mark. This might work only as a temp solution to relive pressure, as it fundamentally does not address all the problems.

vladsud commented 3 years ago

PR from Gary: https://github.com/microsoft/FluidFramework/pull/6594 Add additional properties to IServerError by GaryWilber

vladsud commented 3 years ago

@arinwt, @anthony-murphy - can one of you please see if one of you can take it? That's the latest thing that blocks TAP40 tests (from scaling further).

Also note that some of these problems would be alleviated by single-phase summary commit (as client would be more likely the one to be throttled directly by SPO), but fundamentally problem would be unchanged.

arinwt commented 3 years ago

In response to the alternative mitigation mentioned at the bottom of the description: we lowered the timeout originally from 10 min to 2 min to help with initial load problems never reaching 10 minutes for a single client, but now we have code to bypass that by comparing op timestamps. So raising to 10 min should be okay.

vladsud commented 3 years ago

@GaryWilber, can you quantify how much adjusting timeout would help here? I.e. it does not solve nack problems on its own (as client retries right away on nack). I'd also rather see a stress run with that change made locally, before making decision on incorporating it into production (even as temp solution). There is one other thing to consider - should timeouts be variable? And increase as we hit timeouts (i.e. variable in 2..10 min interval)? Would it help this problem, while keeping first failure at 2 min retry interval?

anthony-murphy commented 3 years ago

Looking at the code, the max ack wait time is configurable, so it should be trivial to set in the stress test

https://github.com/microsoft/FluidFramework/blob/e5aa94ed44c8d2d204408b62f9b646dc60dd28d6/packages/runtime/container-runtime/src/runningSummarizer.ts#L288

anthony-murphy commented 3 years ago

here is the change to set it in stress: https://github.com/microsoft/FluidFramework/compare/main...anthony-murphy:set-ack-time-in-stress

@vladsud and @GaryWilber what are next steps here? They can either run with local change, or i can easily check the above in

GaryWilber commented 3 years ago

@anthony-murphy Will that change work? It looks like it picks the smaller ack wait time, so it will always pick the const maxSummarizeAckWaitTime = 120000; one

I think we can update our scale test with the change and rerun it +@vikagrawal

anthony-murphy commented 3 years ago

you are right. not sure how i missed that. Added some changes to my diff above, will probably just make it a PR. Do you know what odsp sends down for the maxAckWaitTime

GaryWilber commented 3 years ago

@anthony-murphy Today ODSP sends 10 mins. Here's the full object

Unrelated but it looks like we're sending the server config values too, which is unnecessary for clients to know. I'll clean that up

arinwt commented 3 years ago

If we do want to implement retryAfter logic, I think it should happen in Summarizer. I think if we see a summaryNack with retryAfter, then we wait that much time before retrying. The current behavior is that the summarizer client tries 3x before closing, so in between these attempts it should add a delay timer of retryAfter. For the 3rd and final attempt, we will need to decide between 1 of 3 options:

close immediately
wait retryAfter time and then close
wait retryAfter time and then retry up to N more times

Note: I do not think we need to intermix this with SummaryAckWaitTimeout, since they are tracking 2 different things- if we received a summaryNack, that means that we are no longer expecting a summaryAck or summaryNack.

Nice to have: When a new summarizer client is connecting and catching up, it could parse retryAfter from any summaryNack, and then if there is any remaining time to wait (by comparing op timestamps), wait that much longer before starting. If we don't implement this, it should be okay- the summarizer will get nacked and then wait.

vladsud commented 3 years ago

Can we handle summaryNack with retryAfter the following way:

Change timeout we use for SummaryAckWaitTimeout to the value specified in nack + some safety delta (let's say same 2/10 min we use for timeouts)
Other than that, ignore nack as if it never happened and continue with the rest of existing logic?

The nice thing about it is that PUSH can use such nacks to continuously update client on delays. It may, for example, take 3 attempts each with 5 min delay to finally commit to SPO. And realistically it's not a nack, it's more "please wait a bit" message.

This has a nice property that there is no leakage, i.e. the rest of the system (including 3 retries you mentioned) care or should know about this process - it's contained to smallest piece of logic.

goyala-ms commented 3 years ago

@anthony-murphy Will that change work? It looks like it picks the smaller ack wait time, so it will always pick the const maxSummarizeAckWaitTime = 120000; one

I think we can update our scale test with the change and rerun it +@vikagrawal

Sure @anthony-murphy we will try running this on Monday.

vladsud commented 3 years ago

Ok, it looks like I misunderstood what PUSH will do when sending summary nack with 429/retryAfter I assumed that's 1, but @GaryWilber mentioned 2 in today's meeting, so we need to get to the bottom of it, as both approached have pros & cons:

PUSH sends nack with 429 but continues to submit that summary to SPO. In this case it makes sense for client to take retryAfter and add it to existing summary ack timeout and keep waiting for updates.
- Pro: there is likely less work on client & SPO as we will try to submit old summary that SPO already has .app part
PUSH sends such nack and immediately drops that summary, Client starts over after timeout
- Pro: We will try to summarize using latest state.
- Con: more work on client, and client will move through 1-3 phases of summary types (they are more and more expensive), adding cost to client & server.

And while we are waiting for PUSH fixes, there is another problem that needs attention: Today, there are two waits 429s are communicated to client:

as a nack, when PUSH gets 429 while sumbitting summary
as silence, when PUSH gets 429 on trying to commit ops that was triggered as part of summarizing process.

# 2 is alleviated by moving from 2 to 10 minute timeout. We do not have any solution yet for # 1. While PUSH future changes should return 429 for both, in the meantime, and as a general - in order to be a good citizen, we should ensure that client does not immediately comes back with new summary on nack. Especially that summaries are becoming more expensive (including submitting full summary). We should add some kind of progressing delays to that process, and delays be longer for more expensive operations. Maybe even reconsider 1-3 phases, maybe we do somethin like this:

Normal summary
Refresh latest + normal summary
Delay + Refresh latest + normal summary
Longer Delay + Refresh latest + full summary

GaryWilber commented 3 years ago

Some additional notes about when push will process summary ops:

Push essentially has a single queue per document that handles posting ops & summaries to SPO. So the queue could theoretically look like this: [ { post ops message }, { post summary @ seq 100 }, { post ops message } ] Push processes the queue messages sequentially. So it will post ops to SPO, then post the summary for 100, then post ops again. The summary ack/nack is sent after posting the summary. So if the queue is taking a long time to get to your summary op, you wouldn't be getting a summary ack/nack for a while. Normally there is not a large delay here... unless SPO is being slow or tells us we're throttled!

So take this case: [ { post ops message }, { post summary @ seq 100 } ] If we post ops for that first message and SPO tells us "you're throttled, try again in 5 minutes". We will now stop processing the queue and wait 5 minutes. That means that summary @ 100 message will not be processed until 5 minutes from now. Since we got throttled due to the post ops flow, and that flow is technically "internal" to Push, the client does not know this is happening. If the client does want to know about this, perhaps we could send a signal telling them we are experiencing throttling.

anthony-murphy commented 3 years ago

@tanviraumi how does this work in FRS? I think that storing ops is decoupled from summary storage due to ops going to mongo and summary to blob storage. It also seems that FRS could get throttled in the same way, and i want to make sure that whatever we do works well, or is configurable enough for both services

vladsud commented 3 years ago

@GaryWilber, it would be great to update this issue with the design you are going after to address these problems. Some questions that I'd love to see answers:

Will PUSH examine this queue and nack summary ops when it encounters any issues with posting ops?
Or maybe there will be other mechanism for PUSH to communicate that info to client?
Are you planning to retry 429s on post summary or not?
What is the long terms story here and does it affect decisions right now? I.e. with single-stage summary commit, we need some mechanism for relevant ops to land in SPO - are we going with pull model (SPO pulls ops from PUSH) or push model (there is some way for client to tell push to flush ops to SPO before client moves along with summary)?
- maybe we need to go with some temp solutions here until final design lands?

We also need to make sure that overall design is flexible for other services to have a chance to have slight different design (within reasonable boundaries).

GaryWilber commented 3 years ago

@vladsud

No. The queue in question is actually a kafka consumer topic so it's not possible to read-ahead. We generally read a single message, act on it, then checkpoint the queue - this is simple and ensures reliability incase there are server issues such as a crash. It is tricky to nack future summary ops while staying reliable because the messages are either post ops or post summary - we would need to ensure post ops messages will eventually be processed so we can't really skip them. I would rather not try to do this if possible.
I would prefer doing this instead of number 1. Would a signal suffice or some new op type (serverBusy?) that says "Push is busy so try after Y amount of time"? I'm hesitant about making it a new op type because that could result in a loop where us trying to post ops and being throttled results in more ops accumulating. Sounds weird to me
We could potentially do that. If we do retry post summary calls and do get throttled, we wouldn't be sending a summaryNack immediately in this case since we're still processing the message (waiting to be unthrottled). So would end up in a state where push does nothing with the summary op for X minutes, until eventually the ack/nack is sent. Maybe would want to leverage the solution for number 2 in this case to tell the client we're still working on it. So to answer your question I was not planning on doing this but let me know if you think it's worth it. I kind of like how simple the logic is today in the server. We process the queued "post summary" message and do a post request to SPO. If we get a 200 from SPO, we send an ack otherwise we send a nack. Then we checkpoint the kafka consumer.
I don't think this would affect the single-stage summary plan. There wouldn't be throttling if SPO pulls ops from Push. And if we go with a way for the client to tell push to flush ops, that is essentially a new websocket api. It would go through a different path than the queue I mentioned before. So we could add relevant error info / retry after properties to that new "flush ops" response message. A more broader question: Is it necessary to ensure the ops before this new summary are in SPO at the time of summary creation or could those ops show up a few minutes late and we're still okay? I think it's actually unnecessary. We should have a chat about it sometime.

anthony-murphy commented 3 years ago

For number 4, if there is any chance the ops won't make it to storage, then i think we need to ensure ops are committed before summary is committed.

GaryWilber commented 3 years ago

Yes it is possible for ops to not make it to SPO if Push or SPO are having issues when we're trying to post ops. But that is how it is today too. Push would be using the same post ops logic as it does now in a single-stage commit world. So I don't think moving to a single-stage commit increases the chance of this happening... unless the post ops api is actually broken and only the post summary api is working - but that would result in issues/data loss for both cases too.

@anthony-murphy What scenario requires ops before the current summary? Is it some offline scenario where a client is working based on an old summary?

vladsud commented 3 years ago

We should get to the bottom of designs here, as otherwise we are making changes that we might want to redo right away.

In earlier discussions, Marcelo mentioned that SPO requires ops prior to summary to be in SPO, otherwise summary upload would be rejected. So we need some mechanism (push, pull) for ops to land in SPO. That said, this requirement will ensure that ops are flushed frequently to SPO (similar to today). If this is something we want to improve, or if we want API to be simpler (no "push ops" API), then we might want to reconsider it. I feels it's Ok for summaries to be accepted without prior ops if there is a way to reliably delete such snapshots on ops flush failures. Though it adds more complexity (or rather moves it around) , and if we can have situation where ops are lost forever but "future" snapshot was not deleted, then it's really bad and we should not go that .

Some signal from PUSH is required (when PUSH can't flush ops), otherwise client has no clue why summaries are not going through. I think it's good to have irrespective of summarization problems - app might raise awareness to the user that data there is a chance data might not be safe. Can we have that signal be delivered (re-broadcast with adjusted delay) to all newly joining clients? We would also need to ensure loader does not drop these signals (today it can, if ordering service connection is established before runtime is loaded). Op feels much more reliable way to deliver it (no need to re-broadcast). If it's just a single op, I'd rather go an op way.

I think it's Ok not to retry summary upload. We can always come back to this decision based on newly available information. That said, I'm not sure nack with delay info is what we need irrespective of decision in this area. I'd prefer to have a single way to communicate delays (signals discussed above), and nack be always about failure to summarize only. This would need to ensure that signal is delivered before nack, which is icky (in general, there is no ordering guarantees here).

vladsud commented 3 years ago

Based on the latest scalability run (mentioned in https://github.com/microsoft/FluidFramework/issues/6555), 10 min timeout on summaries is not helping much. We need to confirm it (based on server telemetry), but natural explanation - PUSH gets throttled multiple times when flushing ops, and this blocks summaries and thus notification to the client that summaries are not moving along. So I believe

We need to reset back summary ack timeouts to 2 min - summaries in normal cases should not take that long.
We need either to eliminate ops flush throttling, or have some way for PUSH to notify client of it (sort of "PUSH is in distress" signal) such that client could take it into account.

GaryWilber commented 3 years ago

Let's work under the assumption that Push getting throttled will always be possible so we need to be able handle them without things breaking. So we should implement a way for push to tell clients about trouble.

Can we have that signal be delivered (re-broadcast with adjusted delay) to all newly joining clients? We would also need to ensure loader does not drop these signals (today it can, if ordering service connection is established before runtime is loaded). Op feels much more reliable way to deliver it (no need to re-broadcast). If it's just a single op, I'd rather go an op way.

Ops would certainly be easier but I am still somewhat worried about the fact that throttling would result in us inserting another op into the stream. Curious to hear Marcelo's perspective on that.

If we wanted to use signals and wanted to make sure newly connected clients get this too, we could leverage the initialSignals property in IConnected. The flow could look something like:

Push creates the "push is having issues" signal and internally stores it in some redis entry for the session
Push broadcasts that signal to clients
Newly joining clients get that signal via initialSignals - it would be populated based on the redis entry in step 1.
Once we make a successful post ops / summary call, we could remove that redis entry.

That's more complex than the doing it the op way though.

vladsud commented 3 years ago

Moving it to August, given that we need to have design (end-to-end) and existing proposal also waits for PUSH changes propagation.

vladsud commented 3 years ago

Note: I'm forking the issue of flushing ops and providing feedback on when it gets stuck into https://github.com/microsoft/FluidFramework/issues/6685. This issue should track only handling nacks & retryAfter on nacks.

It should assume that nack is a failure, i.e. PUSH will not retry that same summary, and client (after delay) should simply restart same process again from scratch (i.e. upload summary, send summaryOp, etc.).

vladsud commented 3 years ago

Actually, I'll move it back to July, @arinwt - FYI, please let me know if this is not possible. Reason for that - we should not wait for PUSH changes propagation as at this point it will take longer for client changes to propagate (to hosts), and changes do not have any required ordering. Also for design, this items is completely decoupled from #6685 and that issue has a plan and they no longer have any direct dependencies or unknowns, so we should proceed with both independently as time permits.

Summarization issues are really the only main issues that are blocking progression of scalability tests (and thus readiness for GA), so it should be our top priority across the board.

And this fix feels rather trivial - propagate retryAfter up to RunningSummarizer.trySummarize() and adjust timeouts and/or retry same step (see PR #6629)

microsoft / FluidFramework

Add handling of summary nacks error codes and retryAfter #6596