catchUp blobs being too outdated contribute to snapshot size

vladsud commented 3 years ago

PR https://github.com/microsoft/FluidFramework/pull/6235 addressed GC specific concerns of summary content (header + body) for Sequence DDS being based on state of Sequence at MSN, not at latest sequence number.

However, examining random files (for example Chapter 2.1 Collaborative Meeting Notes — Status), catchUp blobs account to 12% of total blob content in a file that comes in snapshot (i.e. excluding images), and has direct impact on snapshot size and processing times. I looked through top ~35 blobs by size, and 21 of them are catch up blobs. This makes sense given usage of tables (including a bunch of deleted tables that GC does not sweep today). While GC will improve overall situation for files like that, percentage wise we will likely continue to see same thing in files - catch up blobs for various sequences holding a lot of ops and summary not being regenerated even when final sequence number moves a lot. For example, a the time when I looked at it, MSN for file was at 159883, but all but one of the catchUp blobs had ops in range of 14xxxx numbers (i.e. 10K behind).

I do not have good proposal here (other than maybe come back to idea to serialize segments as they are in memory?). But it feels like it's something we need to start thinking about in more broader terms than just GC.

@DLehenbauer , @agarwal-navin - FYI

anthony-murphy commented 3 years ago

Based on the scenario you describe, these must be sequences that have no recent ops, so have not been re-summarized. Regardless of if we have catch up ops, or inline the content we'll have a similar amount of data. With blob aggregation the number of blobs should also be the same. The solution here could be to summarize based on min seq change, or some heuristic around min seq change. That will allow removing this data regardless of the format it is stored in.

Also, we have the code to summarize the content inline, but this will break the format and things like search.

vladsud commented 3 years ago

There are total 135 catch up blobs, total size of 567Kb out of 3.6Mb of total blob sizes coming in snapshot (counted them all :)) If we summarize all of these Sequences (using latest MSN), then total impact of these ops would be negligible to total size, but catchUp blobs would go away.

I do not know what's the best course of action here, I want to at least record data to feed into future discussion.

Search & sever using fixed schema is a concern, but it's solvable problem - any change in format can be preemptively addressed by teaching service code to understand new format. Note that similar to GC, this is causing trouble to search today - no amount of waiting / poking in the document will guarantee correct data lands in search index, as all of those Sequences are not going to be updated unless user changes them. I think this angle is actually more important than perf. One possible solution (that already exists in this space) - all search terms to be provided by client, thus taking server parsing out of the picture (with exception of search blobs themselves). This would substantially increase summary sizes though.

vladsud commented 3 years ago

I'll try to collect extra data, including:

collab window at the time of summary (and ideally - size of quorum, and maybe some extra data about some of the clients).
how often that problem happens and how big it is (maybe indirect data)
any easy adjustments in summarization process to ensure smaller collab window at the time of summary generation
why we see so many catch up blobs, i.e. is there anything causing it?

One thing to add: I believe big part of the is recent adjustments to frequency of noops. Idle clients no longer are eager to communicate their reference number. This causes summarizer always to observe large collab window, especially in large docs I've looked (i.e. any doc that is likely to have idle write clients at the moment of summary). Some possible solution here:

Increase noop frequency. This causes file size to grow due to number of noops
Potentially reconsider how noops are sequenced, or how ref seq number is communicated to the server - see some of the discussions in https://github.com/microsoft/FluidFramework/issues/5629
Clients to proactively (on their own) switching to read connection after some inactivity (today that happens on 15 min mark by server).
summarizer sending a signal for all clients to update their latest ref sequence number (send noop) if the last op they sent has ref sequence number below the one communicated by summarizer. Summarizer clearly needs to give some time to wait for the process to end.

agarwal-navin commented 3 years ago

Added design-required tag to this one.

vladsud commented 3 years ago

https://github.com/microsoft/FluidFramework/issues/5629 has more discussion on this issue. The summary - we are changing coalescing timer on server from current 250ms to 2000ms. This will reduce number of noops ops produces in response to client sending noops. And with that, we will revert client logic to send noops 250ms after any op if client did not sent any op in this timeframe. That's how code worked before but it was generating too many ops, so we switched to design of sending very few noops, and while it helped with this issue, it created really large collab windows that contributes to this problem.

vladsud commented 3 years ago

Would be great to resolve https://office.visualstudio.com/OC/_workitems/edit/5173446 once this issue is resolved

vladsud commented 3 years ago

Addressed by these PRs: https://github.com/microsoft/FluidFramework/pull/7085 https://github.com/microsoft/FluidFramework/pull/7092

microsoft / FluidFramework

catchUp blobs being too outdated contribute to snapshot size #6364