vladsud commented 3 years ago

Today, we may fail to summarize document if summary is large. For example, you can take a look at discussion in https://office.visualstudio.com/OC/_workitems/edit/4883680, where summary upload / server processing could take 2 minutes, and it runs into coherency issues on server as there is good chance PUSH is pushing ops to storage during this time, pretty much ensuring summary failure.

While other parts of the system needs to be taking a look (including reducing frequency PUSH flushes ops to storage), we also need to have more incremental way of uploading summaries.

We already could do it by

Taking some part of the tree and uploading it independently.
Using handle we get from above to either be representation of sub-tree in final summary (i.e. what we uploaded above is whole data store as a root, and just reference that handle instead of uploading data store sub-tree), or cherry pick blob by blob or tree by tree from that handle, using existing mechanism of referring to blobs / trees from prior summaries.

I'd propose that we use "simple" logic that is

if blob + edge count or total blob size is greater than certain limits for data store (first level in the tree), we upload that tree separately.
And replace it with a handle we got from above.

Note - uploads should not be parallel, as that will defeat the purpose and cause coherency failures on storage side. Also we need to have some bail-out logic, likely leaving it up to container runtime to cancel whole process by passing cancellation token

Needs to collect more data with this issue.

[x] #5618
[ ]

jatgarg commented 3 years ago

But this would only take into account size set per data store. The problem we had was the number of data stores in the snapshot. So individual data store size could be within the limit but the total snapshot could be big. So maybe we have a limit and try adding data stores to that and when we reach that limit, then we upload that many data stores in one tree and use its handle returned in the final tree. Then keep doing this for the entire snapshot.

vladsud commented 3 years ago

Right, we should group multiple data stores until we hit a limit, and flush either before or after adding last data store (or maybe flush that last data store separately) I.e. if the limit is 4Mb, then

0.5 +5 -> flush 0.5 + 5, or 5 and keep 0.5 (continue accumulate).
0.5 + 3 + 1 + 2 -> flush 0.5 + 3 + 1, reset accumulation with 2.

jatgarg commented 3 years ago

Going to present the design in coming Monday meeting.(6/7/21)

jatgarg commented 3 years ago

The natural solution is to upload the summary into parts based on some heuristics. The metrics which are worth considering are: 1.) Number of blobs – This metric really determines the size of summary and as blobs mostly constitute the size of the summary. According to telemetry, 99 percentile for number of blobs in uploaded summary is 650, 99.9 is 4800 and 100 is 38K blobs. 2.) Number of nodes which are not blobs – This metric does not constitute mostly to the size, but it adds to the complexity of summary getting processed at the server. So, we can set a limit on it also. 3.) Size of blobs/datastores – This means have a size limit L, and then we keep adding blobs to a bag and then when the size of this bag reaches L, them upload that part of trees separately. However, on my discussion with Marcelo, we concluded that this basically averages out across the blobs, and we can just consider the number of blobs/nodes as the metric instead of size. This also reduces the complication that comes with choosing blobs to fit into the bag to make a lesser number of network calls. Ex: If the limit is 4Mb, then if we are processing data stores sizes, then: • 0.5 +5 -> flush 0.5 and keep 5 and start adding others, or flush 5 and keep 0.5 (continue accumulate). • 0.5 + 3 + 1 + 2 -> flush 0.5 + 3, continue accumulation with 1 + 2.

So, a combination of number of blobs and number of non-blob nodes can be used as a heuristic to determine when to upload a tree separately based on whichever limit is hit first. Then based on telemetry we can later change these limits. Also, we can have a fallback which comes into play whenever our normal limits are not working. Then we can have conservative limits with a much lower number of nodes and blobs as limits.

Method of uploading the summary parts. Let us say we have a tree with 3 data stores: { path "a", type "tree", value: {children...} path "b", type "tree", value: {children...} path "c", type "tree", value: {children...} } We upload the child “a” first and get handle “handleA” back. handleA = {path "a”, type "tree", value: {children...}} Then we want to upload the tree with child B separately we must include child A too as per semantics at server:

                          {
                                path "a", type "commit", value: “handleA/a”
 handleB =                   path "b", type "tree", value: {children...}
                         }

Then we can upload the full tree and get final “handleFinal” back:

                                  {
                                               path: "a", type: "tree", value: "handleB/a",
 handleFinal =            path: "b", type: "tree", value: "handleB/b",
                                               path "c", type "tree", value: {children...}
                                  }

So, the crux is that every time we upload a tree, we need to include the previous commits in the tree as otherwise we are not sure whether the server could retain then or not. Also, this will be a serial process because of this and coherency failures that could occur by making parallel requests.

Process Cancellation This means allowing runtime to allow cancellation or sending some timer to cancel this process in case it is taking long time. But I have a doubt over it as if we are not able to upload the summary at a certain point in time, then we would not be able to upload it later as the summary would get bigger with time, and we do want to upload the summary for proper functioning of container as that is whole point of this issue. So maybe this is not plausible?

vladsud commented 3 years ago

Thanks for sharing! Some number of questions / observations:

I assume that any intermediate result is lost if summarizing client loses connection, and needs to start over, as it does not know if data it uploaded is still relevant. That's worth calling out explicitly.
We need to clearly spell out how server will communicate to client that handle we are using (for previous part) is no longer valid and whole process needs to be scratched and start over. It's possible to get here because another client decided to become summarizer, either because this client did not make progress quickly enough or because server decided that this client lost connection and another summarization client was elected.
Description says that it does not matter how we partition data and that it's not beneficial to do it at data store level, yet all examples are at data store level. It would be great to clarify exact algorithm, i.e. does it look at individual blobs irrespective of where they are or not? And also what does it do if it arrived to really large blob (like 3Mb), but the bucket (it uses to accumulate blobs) is small (only 1.5Mb) - do we skip that 3Mb blob and try to find smaller blobs to add to bucket, flush 1.5Mb bucket, or add 3Mb blob and then flush it?
I think the most interesting question - is this ODSP specific? Or should it be generalized such that logic is in runtime, and any driver can implement incremental upload? Note that a driver can always opt out by setting size to infinity. Current driver API would likely needs to be changed a bit to support that, but change is small (and likely can be done in back compat fashion, ideally - through the use of adapter that converts new API to old).

jatgarg commented 3 years ago

1.) and 2.) If the process of uploading summary gets halted in between due to connection loss, then the uploaded part will just be garbage collected at server. "channel" trees are only valid until the next snapshot tree "channel" or "container" gets created on server. So the new client will join and start uploading summary and the previous uploaded incomplete tree will just be garbage collected. 3.) In my design, as I said that the size is usually averaged out across blobs, so I am proposing to avoid using the size of blobs as heuristics and just use no of blobs and no of trees as the parameters. This leads to avoid use of buckets for sizes. We will just upload at the data store level and maybe improve it later if we face some issue. So lets say we have a limit of n blobs and m trees and we are processing data stores and counting trees and blobs inside a data store and we hit a limit in any of n and m, we will end adding more data stores at that point in time but will include that data store in our bucket and upload that bucket in separate network call. 4.) I think it to be odsp specific because one r11s already uploads tree by tree and second is that the drivers/servers could have different way of referring to previous commits and other things, so maybe the 1 way would not fit other all drivers. As we can see in this example, for odsp whenever we upload a partial tree it needs to refer to previous partial tree and include that as well, otherwise the previous partial tree would get lost.

vladsud commented 3 years ago

Not looking at blob sizes will result in very uneven batches. Consider common case of really large text data store (Scriptor) and a lot of small data stores (table 10x10). Number of nodes in each data store is roughly the same, but total payload size is vastly different. Also given that we are moving toward aggregated blobs, number of nodes would be even more uniform across data stores. I think much better approach is to define max total size (let's say 8Mb) and flush either current bucket of blobs, or new blob we are considering to add to this bucket, whatever is bigger, when total size is over limit (8Mb). I.e.

bucket is 3.5Mb + next blob 4Mb -> no flush yet (not above size limit).
bucket is 3.5Mb + next blob 4.6Mb -> flush 4.6Mb blob, keep bucket for future blobs to be added to it.
bucket is 4.5Mb + next blob 4Mb -> flush bucket, start over by adding new blob to it.

For code structuring, I'd make it as much as possible reusable. I.e.

Either modify existing IDocumentStorageService to support possibility of incremental summaries (likely via adding some fields in ISummaryContext to be able to express sub-summary, i.e. avoid asserts / checks in OdspSummaryUploadManager.writeSummaryTree around checking / resetting this.lastSummaryProposalHandle), OR create new interface for it that is almost identical to IDocumentStorageService except for incremental summary support. Let's call it IDocumentStorageServiceIncremental
Create stand-along adapter implementation that implements IDocumentStorageService via passing through all calls to wrapped (internal) IDocumentStorageServiceIncremental except for uploadSummaryWithContext that it implements using policy described above.
ODSP factory will create this adapter object to wrap around OdspDocumentStorageService class that is mostly not aware of incremental summaries (other than implementing IDocumentStorageServiceIncremental and whatever requirements it brings).

jatgarg commented 3 years ago

Some data as per our telemetry: Recently for data of last 30 days we haven't found any failure due to timeout. This shows that it is very rare issue. Also for summary upload events: Success events 500K events total

Percentile	Size(MB)	Blobs
99	1.7	650
99.5	3.3	1450
100	29	38K

Unsuccessful events(Not necessarily timeouts, in fact we haven't had any failure due to timeout in last 30 days) 5K events

Percentile	Size(MB)	Blobs
95	0.8	606
99	3.5	4k
99.5	11	4.5K
100	27	20K

This data does not present any co-relation to size as we can see the larger requests have been successfully uploaded or completed. (29MB(success) > 27 MB(failure)) and (38K blobs(success) > 20K blobs(failure))

Basically this shows that these sizes and blobs are within the limits of upload. Larger summaries should fail at some point.

vladsud commented 3 years ago

Thanks for the data! Some thoughts:

We should figure out why some summaries are so large. 29Mb is a lot, even for full summary, and that likely points to something we need to address irrespective of this issue.
- I'd guess there is high correlation with blob counts. If so, that likely points to tables and possibly to GC not having sweeping. Navin recently added telemetry I think to count number of data stores - references and unreferenced in summary / snapshot, can we get some input on how GC sweep will help or not help here? Some projection would be super useful, as it will tell us if same problem would be addressed eventually by it or not
- Also it would be great to look at snapshot sizes for such big cases (summaries). Incremental summary is not going to help with boot that depends on transferring 29Mb.
I'd not make claim that size does not matter. Our audience at the moment is skewed toward people with good internet. This will change as we have more users. Also even 29Mb payload for Teams' scenarios would be huge (I'd expect data partitioned per host would be skewed toward office.com having larger payloads)
I'd think 99% of large summaries are "safe" summaries, i.e. full summaries. We should look into reasons for them. Currently it's likely very negatively impacted by issue https://github.com/microsoft/FluidFramework/issues/6417, and it might be hard to see beyond it. But Arin is promising a fix today, so we should start looking at other reasons and how we can get to a case where we have very few full summaries out there.
We need to get clarity from ODSP if 38K nodes per summary is acceptable. @marcmasmsft, can you please help us with some data / your thoughts here?

Without more data, I'd say that optimizing (and understanding) our total file size and # 3-4 above are top priority.

vladsud commented 3 years ago

Oh, number of blob counts actually will not be such a big problem once we have blob aggregation. So that's probably the wrong angle to optimize (though still useful to have all the data and input from SPO). Size is what is really worrisome

jatgarg commented 3 years ago

We need to collect more data about number of data stores uploaded per each summary. This issue about recording this telemetry data is tracked here: https://github.com/microsoft/FluidFramework/issues/5618 and assigned to @chensixx

jatgarg commented 3 years ago

The design is in. Moving it to July to collect more data with above issue and see what is the priority and whether it needs to be implemented or not.

vladsud commented 3 years ago

Here are some stats for snapshots (i.e. final size of a file, uncompressed) - Data_bodySize from TreesLatest_end for last 7 days: 98th percentile: 4.5Mb 99th percentile 6.5Mb 99.5th: 9.3Mb max: 35.7Mb

Note that while this is bad for perf, it's mostly bad for processing, memory consumption and CPU, but not that bad for bandwidth, as ODSP will use gzip compression above certain blob size limit, so actual payload. For example

Chapter 2 Collaborative meetings is 8.5Mb, but compressed payload is 2.2Mb.
Chapter 1 Collaborative Components Status: 6.6 / 1.1

Some of the bigger files might have been created before images were moved to attachment blobs.

There are

19 docs that are over 10Mb (opened in last 7 days).
- 6 are whiteboard files
- 13 are fluid (not whiteboard)
4 above 15Mb.
- one whiteboard file has only 24 blobs, making it above 0.5Mb per blob on average.
2 above 20Mb.
- One of them (whiteboard file) has 88 blobs, making it 256Kb on average per blob.
- The other (fluid file) has 17K blobs, making it 1.4Kb on average per blob, suggesting this is example of a file we want to understand better.

vladsud commented 3 years ago

There is change on Whiteboard side close to making it to prod that will greatly reduce snapshot sizes by offsetting content into attachment blobs. So we should focus on non-Whiteboard content.

Got access to File Upload Bi-Weekly Notes, and it's 12.3 / 18.8 Mb (compressed / uncompressed).

And yes, it has images in old format, embedded inline. :(

vladsud commented 3 years ago

Here is my proposal to make investigation like that simpler:

Add somewhere in summary information like
- time stamp when file was created
- version of FF runtime used to create this file.
- Maybe some more properties that might be useful, like maybe summary counter.
Every time we summarize the file, or load a file, we add these properties to some telemetry events (or add new events), such that we can track them over time.

While it will not answer all the questions, it will help us do filtered searches in telemetry, like files created over last 3 months (when images where changes), and thus have better sense on correlation to certain features. It will be exact science (maybe files get bigger over time, and thus filtering on time / runtime version used to create file will tell us less about feature set used but lifetime of file). But certain properties (like number of summarizations for this file) will help to peal the onion.

Other than that, I propose we do not attack the underlying problem until we have better data suggesting it is a problem.

I'd also add that I've focused on total file size / snapshot size. But even with really big files, we could make summarize smaller by ensuring we do not use safe summary (as often as we do). One of the latest (0.42?) change by Arin makes improvement here, we need to continue observe telemetry in that space.

vladsud commented 3 years ago

And more overall view into problem space:

WhiteBoard big files - there is fix in flight to offset content into attachment blobs, but hitting 4Mb limit in ODSP, so some more discussion is needed before they can pull it off
Most of the big files I see have inline images (old files). Suggestions above to add more telemetry to differentiate.
There are improvements coming to reduce number of safe summaries and we need to continue to monitor and improve this area.
There are number of independent issues about snapshots too big:
- I'm following up separately on pushing GC blobs (metadata) out of snapshots - some files are really badly impacted here (though it might not help summary, only download).
- catchUp blobs are too big due to big collab window that is result of us not sending noops frequently. There is an issue and design to improve it, waiting for changes on server.
GC sweep is the next best thing we can do (or some temp workarounds similar to above - i.e. pushing them out of snapshot and maybe summary, I'll think about it more).
Time to think about client-controlled compression?

vladsud commented 3 years ago

Issue #6551 has been opened to directly track improvements in telemetry. Closing this issue as we are not going to attack initial problem it raised. At least for now, till more data exists to convince us it's needed.

Other issues listed above, making sure we have references:

6364 catchUp blobs being too outdated contribute to snapshot size

6371 GC has rather big impact on snapshot size

vladsud commented 3 years ago

Worth noting that data for loaderVersion = 0.47.1 looks much better than previous versions. I assume that's impact on changes to noop scheduling.

microsoft / FluidFramework

ODSP: Implement incremental summary upload #5690

6364 catchUp blobs being too outdated contribute to snapshot size

6371 GC has rather big impact on snapshot size