Closed vladsud closed 3 years ago
But this would only take into account size set per data store. The problem we had was the number of data stores in the snapshot. So individual data store size could be within the limit but the total snapshot could be big. So maybe we have a limit and try adding data stores to that and when we reach that limit, then we upload that many data stores in one tree and use its handle returned in the final tree. Then keep doing this for the entire snapshot.
Right, we should group multiple data stores until we hit a limit, and flush either before or after adding last data store (or maybe flush that last data store separately) I.e. if the limit is 4Mb, then
Going to present the design in coming Monday meeting.(6/7/21)
The natural solution is to upload the summary into parts based on some heuristics. The metrics which are worth considering are: 1.) Number of blobs – This metric really determines the size of summary and as blobs mostly constitute the size of the summary. According to telemetry, 99 percentile for number of blobs in uploaded summary is 650, 99.9 is 4800 and 100 is 38K blobs. 2.) Number of nodes which are not blobs – This metric does not constitute mostly to the size, but it adds to the complexity of summary getting processed at the server. So, we can set a limit on it also. 3.) Size of blobs/datastores – This means have a size limit L, and then we keep adding blobs to a bag and then when the size of this bag reaches L, them upload that part of trees separately. However, on my discussion with Marcelo, we concluded that this basically averages out across the blobs, and we can just consider the number of blobs/nodes as the metric instead of size. This also reduces the complication that comes with choosing blobs to fit into the bag to make a lesser number of network calls. Ex: If the limit is 4Mb, then if we are processing data stores sizes, then: • 0.5 +5 -> flush 0.5 and keep 5 and start adding others, or flush 5 and keep 0.5 (continue accumulate). • 0.5 + 3 + 1 + 2 -> flush 0.5 + 3, continue accumulation with 1 + 2.
So, a combination of number of blobs and number of non-blob nodes can be used as a heuristic to determine when to upload a tree separately based on whichever limit is hit first. Then based on telemetry we can later change these limits. Also, we can have a fallback which comes into play whenever our normal limits are not working. Then we can have conservative limits with a much lower number of nodes and blobs as limits.
Method of uploading the summary parts. Let us say we have a tree with 3 data stores: { path "a", type "tree", value: {children...} path "b", type "tree", value: {children...} path "c", type "tree", value: {children...} } We upload the child “a” first and get handle “handleA” back. handleA = {path "a”, type "tree", value: {children...}} Then we want to upload the tree with child B separately we must include child A too as per semantics at server:
{
path "a", type "commit", value: “handleA/a”
handleB = path "b", type "tree", value: {children...}
}
Then we can upload the full tree and get final “handleFinal” back:
{
path: "a", type: "tree", value: "handleB/a",
handleFinal = path: "b", type: "tree", value: "handleB/b",
path "c", type "tree", value: {children...}
}
So, the crux is that every time we upload a tree, we need to include the previous commits in the tree as otherwise we are not sure whether the server could retain then or not. Also, this will be a serial process because of this and coherency failures that could occur by making parallel requests.
Process Cancellation This means allowing runtime to allow cancellation or sending some timer to cancel this process in case it is taking long time. But I have a doubt over it as if we are not able to upload the summary at a certain point in time, then we would not be able to upload it later as the summary would get bigger with time, and we do want to upload the summary for proper functioning of container as that is whole point of this issue. So maybe this is not plausible?
Thanks for sharing! Some number of questions / observations:
1.) and 2.) If the process of uploading summary gets halted in between due to connection loss, then the uploaded part will just be garbage collected at server. "channel" trees are only valid until the next snapshot tree "channel" or "container" gets created on server. So the new client will join and start uploading summary and the previous uploaded incomplete tree will just be garbage collected. 3.) In my design, as I said that the size is usually averaged out across blobs, so I am proposing to avoid using the size of blobs as heuristics and just use no of blobs and no of trees as the parameters. This leads to avoid use of buckets for sizes. We will just upload at the data store level and maybe improve it later if we face some issue. So lets say we have a limit of n blobs and m trees and we are processing data stores and counting trees and blobs inside a data store and we hit a limit in any of n and m, we will end adding more data stores at that point in time but will include that data store in our bucket and upload that bucket in separate network call. 4.) I think it to be odsp specific because one r11s already uploads tree by tree and second is that the drivers/servers could have different way of referring to previous commits and other things, so maybe the 1 way would not fit other all drivers. As we can see in this example, for odsp whenever we upload a partial tree it needs to refer to previous partial tree and include that as well, otherwise the previous partial tree would get lost.
Not looking at blob sizes will result in very uneven batches. Consider common case of really large text data store (Scriptor) and a lot of small data stores (table 10x10). Number of nodes in each data store is roughly the same, but total payload size is vastly different. Also given that we are moving toward aggregated blobs, number of nodes would be even more uniform across data stores. I think much better approach is to define max total size (let's say 8Mb) and flush either current bucket of blobs, or new blob we are considering to add to this bucket, whatever is bigger, when total size is over limit (8Mb). I.e.
For code structuring, I'd make it as much as possible reusable. I.e.
Some data as per our telemetry: Recently for data of last 30 days we haven't found any failure due to timeout. This shows that it is very rare issue. Also for summary upload events: Success events 500K events total
Percentile | Size(MB) | Blobs |
---|---|---|
99 | 1.7 | 650 |
99.5 | 3.3 | 1450 |
100 | 29 | 38K |
Unsuccessful events(Not necessarily timeouts, in fact we haven't had any failure due to timeout in last 30 days) 5K events
Percentile | Size(MB) | Blobs |
---|---|---|
95 | 0.8 | 606 |
99 | 3.5 | 4k |
99.5 | 11 | 4.5K |
100 | 27 | 20K |
This data does not present any co-relation to size as we can see the larger requests have been successfully uploaded or completed. (29MB(success) > 27 MB(failure)) and (38K blobs(success) > 20K blobs(failure))
Basically this shows that these sizes and blobs are within the limits of upload. Larger summaries should fail at some point.
Thanks for the data! Some thoughts:
Without more data, I'd say that optimizing (and understanding) our total file size and # 3-4 above are top priority.
Oh, number of blob counts actually will not be such a big problem once we have blob aggregation. So that's probably the wrong angle to optimize (though still useful to have all the data and input from SPO). Size is what is really worrisome
We need to collect more data about number of data stores uploaded per each summary. This issue about recording this telemetry data is tracked here: https://github.com/microsoft/FluidFramework/issues/5618 and assigned to @chensixx
The design is in. Moving it to July to collect more data with above issue and see what is the priority and whether it needs to be implemented or not.
Here are some stats for snapshots (i.e. final size of a file, uncompressed) - Data_bodySize from TreesLatest_end for last 7 days: 98th percentile: 4.5Mb 99th percentile 6.5Mb 99.5th: 9.3Mb max: 35.7Mb
Note that while this is bad for perf, it's mostly bad for processing, memory consumption and CPU, but not that bad for bandwidth, as ODSP will use gzip compression above certain blob size limit, so actual payload. For example
Some of the bigger files might have been created before images were moved to attachment blobs.
There are
There is change on Whiteboard side close to making it to prod that will greatly reduce snapshot sizes by offsetting content into attachment blobs. So we should focus on non-Whiteboard content.
Got access to File Upload Bi-Weekly Notes, and it's 12.3 / 18.8 Mb (compressed / uncompressed).
And yes, it has images in old format, embedded inline. :(
Here is my proposal to make investigation like that simpler:
While it will not answer all the questions, it will help us do filtered searches in telemetry, like files created over last 3 months (when images where changes), and thus have better sense on correlation to certain features. It will be exact science (maybe files get bigger over time, and thus filtering on time / runtime version used to create file will tell us less about feature set used but lifetime of file). But certain properties (like number of summarizations for this file) will help to peal the onion.
Other than that, I propose we do not attack the underlying problem until we have better data suggesting it is a problem.
I'd also add that I've focused on total file size / snapshot size. But even with really big files, we could make summarize smaller by ensuring we do not use safe summary (as often as we do). One of the latest (0.42?) change by Arin makes improvement here, we need to continue observe telemetry in that space.
And more overall view into problem space:
Issue #6551 has been opened to directly track improvements in telemetry. Closing this issue as we are not going to attack initial problem it raised. At least for now, till more data exists to convince us it's needed.
Other issues listed above, making sure we have references:
Worth noting that data for loaderVersion = 0.47.1 looks much better than previous versions. I assume that's impact on changes to noop scheduling.
Today, we may fail to summarize document if summary is large. For example, you can take a look at discussion in https://office.visualstudio.com/OC/_workitems/edit/4883680, where summary upload / server processing could take 2 minutes, and it runs into coherency issues on server as there is good chance PUSH is pushing ops to storage during this time, pretty much ensuring summary failure.
While other parts of the system needs to be taking a look (including reducing frequency PUSH flushes ops to storage), we also need to have more incremental way of uploading summaries.
We already could do it by
I'd propose that we use "simple" logic that is
Note - uploads should not be parallel, as that will defeat the purpose and cause coherency failures on storage side. Also we need to have some bail-out logic, likely leaving it up to container runtime to cancel whole process by passing cancellation token
Needs to collect more data with this issue.
[x] #5618
[ ]