Closed vladsud closed 2 years ago
Note that 99.999% of summaries are incremental summaries:
union Office_Fluid_FluidRuntime_Performance | where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end" | summarize count() by Data_fullTree
Data_fullTree | count_ |
---|---|
0 | 4699458 |
945 | |
1 | 4 |
Here is number of data stores summarized:
union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | summarize percentile(Data_summarizedDataStoreCount, 50), percentile(Data_summarizedDataStoreCount, 90), percentile(Data_summarizedDataStoreCount, 95), percentile(Data_summarizedDataStoreCount, 99)
percentile_50 | percentile_90 | percentile_95 | percentile_99 |
---|---|---|---|
2 | 25 | 43 | 107 |
And this is total number of data stores in a file (when summary is made - Data_dataStoreCount): | percentile_50 | percentile_90 | percentile_95 | percentile_99 |
---|---|---|---|---|
4 | 70 | 107 | 391 |
If I add this condition to previous queries (to remove just created files, as they are almost always tiny), I get a bit worse picture:
Data_referenceSequenceNumber > 1000, P95 Data_summarizedDataStoreCount = 205 Data_dataStoreCount = 1143
Data_referenceSequenceNumber > 10,000, P95: Data_summarizedDataStoreCount = 496 Data_dataStoreCount= 2721
What is interesting is that data improves a bit if you go over 100K ops (P95 in particular looks much better! Not sure why). But overall trends stabilize and stay the same.
Can't easily get size, as driver events do not add referenceSequenceNumber to their events (though can get it through join requests).
This is telling:
union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | where Data_summarizedDataStoreCount > 100 | where Data_referenceSequenceNumber > 1000 | where Data_opsSinceLastSummary < 100 | summarize count() by Data_opsSinceLastSummary | render linechart
Basically, for not very new files (over 1000 ops, but same is true without this filter), when we look at large summaries (over 100 data stores summarized), we do see that average number of summarized ops is 124, but there is a huge number of cases where we have less than 20 ops to summarize. Yet, we summarize over 100 data stores!
That's clearly against expectations (given that we almost do not have full summarize) - most of data stores did not have an op to summarize, so they should not summarize! This is true even if data store was created in between summaries, as there was an op that we counted.
@agarwal-navin, any chance you can take a look? Is there any attack plan we can come up with, to make some progress on this (see latest comment with graph)? While I do not want you to be side-tracked by this (and keep focusing on GC), if you can help with ideas / action plan to add telemetry / tests / some other type of validation to ensure we are moving in right direction to understand the problem and eventually address it, it would be great! I'll assign it to you for Feb, but feel free to push back. That's not urgent, but at the same time if others can help (with your guidance) to make progress, it would be great!
Those are some interesting stats. I will take a look. I remember there was a bug I fixed some time back where we were taking one extra summary to realize that a data store did not change. Let me see what that bug was and if I can find anything obvious.
One thing we should probably add to the summary telemetry - Is this the first summary by this client. That might help us finding issues where we do not initialize from the base snapshot correctly.
I had created this issue some time back to add testing for this scenario - https://github.com/microsoft/FluidFramework/issues/8498. It might be a good time for someone to pick this up.
I looked at the data for a few documents via this query:
union Office_Fluid_FluidRuntime_*
| where Data_docId == <docId from the above queries>
| where Data_eventName contains "GenerateSummary" or Data_eventName contains "GarbageCollection"
| project Event_Time, Data_eventName, Data_opsSinceLastSummary, Data_summarizedDataStoreCount, Data_dataStoreCount, Data_totalDataStores, Data_deletedDataStores, Data_containerId
I see this pattern:
My theory is that GC is messing up when it loads data from the base snapshot. When running summary, it thinks that the GC state changed since last summary, so it summarizes again. I will try writing couple of tests to verify this.
Update: Scratch most of this, as totalAttempts does not track total updates across session, only across one trySummarize() invocation :(
It's a pain in the butt to figure out all the props available on events, but we do have summary count on these events:
const summarizeProps: ITelemetryProperties = {
summarizeReason,
summarizeTotalAttempts: totalAttempts,
summarizeAttemptsPerPhase: attemptPerPhase,
summarizeAttemptPhase: attemptPhase + 1, // make everything 1-based
...options,
};
And looking at this query:
union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | where Data_summarizedDataStoreCount > 100 | summarize count() by Data_summarizeTotalAttempts | render linechart
Gives us this nice chart:
Kusto uses weird scaling, but it's this table:
summarizeTotalAttempts | count_ |
---|---|
1 | 82431 |
2 | 80 |
3 | 3 |
19 | |
4 | 1 |
The data is of a suspect though, as I'd expect to see long tail. But when I remove the filter and run this query:
union Office_Fluid_FluidRuntime_Generic | where Data_eventName == "fluid:telemetry:Summarizer:Running:GenerateSummary" | summarize count() by Data_summarizeTotalAttempts | render linechart
I get only 5 attempts max, which is too low. Cross-checking it against this query suggests that indeed data above is wrong:
union Office_Fluid_FluidRuntime_Generic | where Data_eventName == "fluid:telemetry:Summarizer:Running:GenerateSummary" | summarize count() by DatacontainerId | summarize count() by count | render linechart
Opened https://github.com/microsoft/FluidFramework/issues/8972 to track telemetry part of question.
summarizeTotalAttempts
starts from 1 every time a client summarizes right. Even for subsequent summaries after the first one by a client, it will start at 1. It might be useful to indicate if this is very first summary by a client or not. We can get that information by looking at the containerId though.
I found one bug which was added by one of my recent GC changes - https://github.com/microsoft/FluidFramework/pull/8389. Garbage collector reads the GC blobs from data store summary trees and converts to a different format before passing it to data stores during creation. However, it does not sort the used routes before passing it to the data stores and so the following comparison in SummarizerNodeWithGC returns false resulting in re-summarization:
return this.referenceUsedRoutes === undefined ||
JSON.stringify(this.usedRoutes) !== JSON.stringify(this.referenceUsedRoutes);
I will fix this and add some tests.
Thanks for looking into it! Can you please play with any samples (TODO might be good example, as it allows to create data stores, I think)? If it's trivial case, it should be easy to repro, as well as validate if that one fix addresses it.
Maybe stupid thing, but what if add assert like that in summarizeCore():
assert(fullTree || summaryData.summaryStats.summarizedDataStoreCount <= opsSinceLastSummary);
or maybe just an error telemetry (for starters).
It feels like it's good validation and should capture the issue pretty quickly... Can GC processes ever break this equation? If so, is it a problem? It feels like system needs to promise that it scales with number of changes and does not have any points that cause it to have really huge spikes.
I repro'd it by writing a test and verified with a temporary fix that it works. I can also try it with a sample.
Yes, GC can break this equation if there are ops that add or remove reference to data stores. For instance, if there is only one op (for dataStore1) since last summary which adds reference to dataStore2, then there will be 2 data stores that are summarized. Practically though, that should not happen a lot. Maybe the following check would be safer:
assert(fullTree || summaryData.summaryStats.summarizedDataStoreCount <= 2 * opsSinceLastSummary);
Super! Happy we made quick progress, but we need to protect it, it's not great that we have such big regressions. I'll take any guards :) Even 2x assert would go a long way! That said, (subject to offline follow up) I'd love to learn (or re-learn) why another data store needs to be re-summarized in such case.
Yes, I agree. Tests in this area have been long time pending and I think we need to add them now.
We could add a 2x assert and also add an error if summarizedDataStoreCount > opsSinceLastSummary
. We can then add alerts when we see the number of hits for this go up.
That said, (subject to offline follow up) I'd love to learn (or re-learn) why another data store needs to be re-summarized in such case.
Sure, lets sync up offline. The gist is that ODSP needs the unreferenced
flag to be part of the ISummaryTree
. If we add that as a property of ISummaryHandle
which refers to an ISummaryTree
in previous summary, it cannot get that tree and then update the unreferenced
flag. We could revisit this discussion now and see if we can change this implementation.
Added few tests to validate these scenarios - https://github.com/microsoft/FluidFramework/pull/8983. These currently fail and will pass with the fix.
@agarwal-navin , it turns out there is Data_summaryGenTag that tracks count of summaries for a given client.
And the data suggests Navin's guess is right:
union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | where Data_summaryGenTag > 1 | summarize percentile(Data_summarizedDataStoreCount, 99)
This gives me 16! Looks great!
Data_summaryGenTag == 1 gives me 167!
Awesome! I have a PR with the fix - https://github.com/microsoft/FluidFramework/pull/8990.
I added an error log when summarizedDataStoresCount > 2 * opsSinceLastSummary
. However, I think we cannot do this because a single op can result in a bunch of data stores being deleted. For instance, a data store can have references to 100s of data stores which can in turn have references to more data stores. Removing the reference to the first data store will result in all of these data stores becoming unreferenced.
In the following logs between summary number 24 and 25, there are 26 ops and 105 data stores are re-summarized. Also, GC logs says that 103 new data stores became unreferenced. I bet 1 or 2 big tables were deleted which resulted in 103 data stores being unreferenced.
Event_Time | Data_eventName | Data_opsSinceLastSummary | Data_summarizedDataStoreCount | Data_deletedDataStores | Data_totalDataStores | Data_dataStoreCount | Data_summaryGenTag -- | -- | -- | -- | -- | -- | -- | -- 2022-01-24T01:27:04.158Z | fluid:telemetry:Summarizer:Running:GarbageCollection_end | | | 0 | 107 | | 25 2022-01-24T01:27:05.184Z | fluid:telemetry:Summarizer:Running:GenerateSummary | 85 | 8 | | | 107 | 25 2022-01-24T01:27:51.402Z | fluid:telemetry:Summarizer:Running:GarbageCollection_end | | | 103 | 107 | | 26 2022-01-24T01:27:54.197Z | fluid:telemetry:Summarizer:Running:GenerateSummary | 26 | 105 | | | 107 | 26This is a valid scenario and can happen quite often IMO.
We should definitely follow up on the need to summarize data store when the reference state changes. That will make catching issues like this relatively simpler.
This query
union Office_FluidFluidRuntime | where Data_eventName contains "GenerateSummary" | where Data_summaryGenTag != 1 | where Data_summarizedDataStoreCount > 2 Data_opsSinceLastSummary | summarize count()
Gives me 1688 hits in 30 days. If you go with 10x multiplier, you get 48K for 1st summary (this bug) and 186 for non-first summary.
Here is my take (without deeply understanding it): This design feels like a bug. Deleting table should not re-summarize hundreds of data stores. So, I'd argue an error (maybe with higher multiplier not to spam logging) is the right thing to do, because it brings attention to the problem. But it's only Ok direction if we intend to make it better in some future.
Yes, I agree. Although it was an intentional design choice, we should revisit and fix it if possible. I created an issue to track this - https://github.com/microsoft/FluidFramework/issues/9009
These numbers are not great:
Part of that - I disabled incremental summaries at data store level, to allow blob aggregation to work correctly (but given it was not enabled ever, we should reconsider it). That said, it does not (alone, or in big part) explains the tail - it's just too many blobs!
I believe that's primary driving force (in addition to Sequence not having efficient differential summaries) to large summaries:
That's a lot of bytes! 314K on average, to be exact!