microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.73k stars 532 forks source link

Summaries are large: We summarize data stores that did not have changes #8963

Closed vladsud closed 2 years ago

vladsud commented 2 years ago

These numbers are not great:

union Office_Fluid_FluidRuntime_Performance | where Data_eventName == "fluid:telemetry:OdspDriver:uploadSummary_end" | summarize percentile(Data_blobs, 50), percentile(Data_blobs, 90), percentile(Data_blobs, 95), percentile(Data_blobs, 99)

percentile_blobs_50 percentile_blobs_90 percentile_blobs_95 percentile_blobs_99
16 191 361 962

Part of that - I disabled incremental summaries at data store level, to allow blob aggregation to work correctly (but given it was not enabled ever, we should reconsider it). That said, it does not (alone, or in big part) explains the tail - it's just too many blobs!

I believe that's primary driving force (in addition to Sequence not having efficient differential summaries) to large summaries:

percentile_size_50 percentile_size_90 percentile_size_95 percentile_size_99
29,352 475,858 1,142,372 5,781,269

That's a lot of bytes! 314K on average, to be exact!

vladsud commented 2 years ago

Note that 99.999% of summaries are incremental summaries:

union Office_Fluid_FluidRuntime_Performance | where Data_eventName == "fluid:telemetry:Summarizer:Running:Summarize_end" | summarize count() by Data_fullTree

Data_fullTree count_
0 4699458
945
1 4
vladsud commented 2 years ago

Here is number of data stores summarized:

union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | summarize percentile(Data_summarizedDataStoreCount, 50), percentile(Data_summarizedDataStoreCount, 90), percentile(Data_summarizedDataStoreCount, 95), percentile(Data_summarizedDataStoreCount, 99)

percentile_50 percentile_90 percentile_95 percentile_99
2 25 43 107
And this is total number of data stores in a file (when summary is made - Data_dataStoreCount): percentile_50 percentile_90 percentile_95 percentile_99
4 70 107 391
vladsud commented 2 years ago

If I add this condition to previous queries (to remove just created files, as they are almost always tiny), I get a bit worse picture:

Data_referenceSequenceNumber > 1000, P95 Data_summarizedDataStoreCount = 205 Data_dataStoreCount = 1143

Data_referenceSequenceNumber > 10,000, P95: Data_summarizedDataStoreCount = 496 Data_dataStoreCount= 2721

What is interesting is that data improves a bit if you go over 100K ops (P95 in particular looks much better! Not sure why). But overall trends stabilize and stay the same.

Can't easily get size, as driver events do not add referenceSequenceNumber to their events (though can get it through join requests).

vladsud commented 2 years ago

This is telling:

union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | where Data_summarizedDataStoreCount > 100 | where Data_referenceSequenceNumber > 1000 | where Data_opsSinceLastSummary < 100 | summarize count() by Data_opsSinceLastSummary | render linechart

image

Basically, for not very new files (over 1000 ops, but same is true without this filter), when we look at large summaries (over 100 data stores summarized), we do see that average number of summarized ops is 124, but there is a huge number of cases where we have less than 20 ops to summarize. Yet, we summarize over 100 data stores!

That's clearly against expectations (given that we almost do not have full summarize) - most of data stores did not have an op to summarize, so they should not summarize! This is true even if data store was created in between summaries, as there was an op that we counted.

vladsud commented 2 years ago

@agarwal-navin, any chance you can take a look? Is there any attack plan we can come up with, to make some progress on this (see latest comment with graph)? While I do not want you to be side-tracked by this (and keep focusing on GC), if you can help with ideas / action plan to add telemetry / tests / some other type of validation to ensure we are moving in right direction to understand the problem and eventually address it, it would be great! I'll assign it to you for Feb, but feel free to push back. That's not urgent, but at the same time if others can help (with your guidance) to make progress, it would be great!

agarwal-navin commented 2 years ago

Those are some interesting stats. I will take a look. I remember there was a bug I fixed some time back where we were taking one extra summary to realize that a data store did not change. Let me see what that bug was and if I can find anything obvious.

One thing we should probably add to the summary telemetry - Is this the first summary by this client. That might help us finding issues where we do not initialize from the base snapshot correctly.

agarwal-navin commented 2 years ago

I had created this issue some time back to add testing for this scenario - https://github.com/microsoft/FluidFramework/issues/8498. It might be a good time for someone to pick this up.

agarwal-navin commented 2 years ago

I looked at the data for a few documents via this query:

union Office_Fluid_FluidRuntime_*
| where Data_docId == <docId from the above queries>
| where Data_eventName contains "GenerateSummary" or Data_eventName contains "GarbageCollection"
| project Event_Time, Data_eventName, Data_opsSinceLastSummary, Data_summarizedDataStoreCount, Data_dataStoreCount, Data_totalDataStores, Data_deletedDataStores, Data_containerId

I see this pattern:

My theory is that GC is messing up when it loads data from the base snapshot. When running summary, it thinks that the GC state changed since last summary, so it summarizes again. I will try writing couple of tests to verify this.

vladsud commented 2 years ago

Update: Scratch most of this, as totalAttempts does not track total updates across session, only across one trySummarize() invocation :(

It's a pain in the butt to figure out all the props available on events, but we do have summary count on these events:

                const summarizeProps: ITelemetryProperties = {
                    summarizeReason,
                    summarizeTotalAttempts: totalAttempts,
                    summarizeAttemptsPerPhase: attemptPerPhase,
                    summarizeAttemptPhase: attemptPhase + 1, // make everything 1-based
                    ...options,
                };

And looking at this query:

union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | where Data_summarizedDataStoreCount > 100 | summarize count() by Data_summarizeTotalAttempts | render linechart

Gives us this nice chart:

image

Kusto uses weird scaling, but it's this table:

summarizeTotalAttempts count_
1 82431
2 80
3 3
19
4 1

The data is of a suspect though, as I'd expect to see long tail. But when I remove the filter and run this query:

union Office_Fluid_FluidRuntime_Generic | where Data_eventName == "fluid:telemetry:Summarizer:Running:GenerateSummary" | summarize count() by Data_summarizeTotalAttempts | render linechart

I get only 5 attempts max, which is too low. Cross-checking it against this query suggests that indeed data above is wrong:

union Office_Fluid_FluidRuntime_Generic | where Data_eventName == "fluid:telemetry:Summarizer:Running:GenerateSummary" | summarize count() by DatacontainerId | summarize count() by count | render linechart

image

Opened https://github.com/microsoft/FluidFramework/issues/8972 to track telemetry part of question.

agarwal-navin commented 2 years ago

summarizeTotalAttempts starts from 1 every time a client summarizes right. Even for subsequent summaries after the first one by a client, it will start at 1. It might be useful to indicate if this is very first summary by a client or not. We can get that information by looking at the containerId though.

agarwal-navin commented 2 years ago

I found one bug which was added by one of my recent GC changes - https://github.com/microsoft/FluidFramework/pull/8389. Garbage collector reads the GC blobs from data store summary trees and converts to a different format before passing it to data stores during creation. However, it does not sort the used routes before passing it to the data stores and so the following comparison in SummarizerNodeWithGC returns false resulting in re-summarization:

return this.referenceUsedRoutes === undefined ||
            JSON.stringify(this.usedRoutes) !== JSON.stringify(this.referenceUsedRoutes);

I will fix this and add some tests.

vladsud commented 2 years ago

Thanks for looking into it! Can you please play with any samples (TODO might be good example, as it allows to create data stores, I think)? If it's trivial case, it should be easy to repro, as well as validate if that one fix addresses it.

vladsud commented 2 years ago

Maybe stupid thing, but what if add assert like that in summarizeCore():

assert(fullTree || summaryData.summaryStats.summarizedDataStoreCount <= opsSinceLastSummary);

or maybe just an error telemetry (for starters).

It feels like it's good validation and should capture the issue pretty quickly... Can GC processes ever break this equation? If so, is it a problem? It feels like system needs to promise that it scales with number of changes and does not have any points that cause it to have really huge spikes.

agarwal-navin commented 2 years ago

I repro'd it by writing a test and verified with a temporary fix that it works. I can also try it with a sample.

Yes, GC can break this equation if there are ops that add or remove reference to data stores. For instance, if there is only one op (for dataStore1) since last summary which adds reference to dataStore2, then there will be 2 data stores that are summarized. Practically though, that should not happen a lot. Maybe the following check would be safer: assert(fullTree || summaryData.summaryStats.summarizedDataStoreCount <= 2 * opsSinceLastSummary);

vladsud commented 2 years ago

Super! Happy we made quick progress, but we need to protect it, it's not great that we have such big regressions. I'll take any guards :) Even 2x assert would go a long way! That said, (subject to offline follow up) I'd love to learn (or re-learn) why another data store needs to be re-summarized in such case.

agarwal-navin commented 2 years ago

Yes, I agree. Tests in this area have been long time pending and I think we need to add them now. We could add a 2x assert and also add an error if summarizedDataStoreCount > opsSinceLastSummary. We can then add alerts when we see the number of hits for this go up.

agarwal-navin commented 2 years ago

That said, (subject to offline follow up) I'd love to learn (or re-learn) why another data store needs to be re-summarized in such case.

Sure, lets sync up offline. The gist is that ODSP needs the unreferenced flag to be part of the ISummaryTree. If we add that as a property of ISummaryHandle which refers to an ISummaryTree in previous summary, it cannot get that tree and then update the unreferenced flag. We could revisit this discussion now and see if we can change this implementation.

agarwal-navin commented 2 years ago

Added few tests to validate these scenarios - https://github.com/microsoft/FluidFramework/pull/8983. These currently fail and will pass with the fix.

vladsud commented 2 years ago

@agarwal-navin , it turns out there is Data_summaryGenTag that tracks count of summaries for a given client.

vladsud commented 2 years ago

And the data suggests Navin's guess is right:

union Office_FluidFluidRuntime* | where Data_eventName contains "GenerateSummary" | where Data_summaryGenTag > 1 | summarize percentile(Data_summarizedDataStoreCount, 99)

This gives me 16! Looks great!

Data_summaryGenTag == 1 gives me 167!

agarwal-navin commented 2 years ago

Awesome! I have a PR with the fix - https://github.com/microsoft/FluidFramework/pull/8990. I added an error log when summarizedDataStoresCount > 2 * opsSinceLastSummary. However, I think we cannot do this because a single op can result in a bunch of data stores being deleted. For instance, a data store can have references to 100s of data stores which can in turn have references to more data stores. Removing the reference to the first data store will result in all of these data stores becoming unreferenced.

In the following logs between summary number 24 and 25, there are 26 ops and 105 data stores are re-summarized. Also, GC logs says that 103 new data stores became unreferenced. I bet 1 or 2 big tables were deleted which resulted in 103 data stores being unreferenced.

Event_Time | Data_eventName | Data_opsSinceLastSummary | Data_summarizedDataStoreCount | Data_deletedDataStores | Data_totalDataStores | Data_dataStoreCount | Data_summaryGenTag -- | -- | -- | -- | -- | -- | -- | -- 2022-01-24T01:27:04.158Z | fluid:telemetry:Summarizer:Running:GarbageCollection_end |   |   | 0 | 107 |   | 25 2022-01-24T01:27:05.184Z | fluid:telemetry:Summarizer:Running:GenerateSummary | 85 | 8 |   |   | 107 | 25 2022-01-24T01:27:51.402Z | fluid:telemetry:Summarizer:Running:GarbageCollection_end |   |   | 103 | 107 |   | 26 2022-01-24T01:27:54.197Z | fluid:telemetry:Summarizer:Running:GenerateSummary | 26 | 105 |   |   | 107 | 26

This is a valid scenario and can happen quite often IMO.

agarwal-navin commented 2 years ago

We should definitely follow up on the need to summarize data store when the reference state changes. That will make catching issues like this relatively simpler.

vladsud commented 2 years ago

This query

union Office_FluidFluidRuntime | where Data_eventName contains "GenerateSummary" | where Data_summaryGenTag != 1 | where Data_summarizedDataStoreCount > 2 Data_opsSinceLastSummary | summarize count()

Gives me 1688 hits in 30 days. If you go with 10x multiplier, you get 48K for 1st summary (this bug) and 186 for non-first summary.

Here is my take (without deeply understanding it): This design feels like a bug. Deleting table should not re-summarize hundreds of data stores. So, I'd argue an error (maybe with higher multiplier not to spam logging) is the right thing to do, because it brings attention to the problem. But it's only Ok direction if we intend to make it better in some future.

agarwal-navin commented 2 years ago

Yes, I agree. Although it was an intentional design choice, we should revisit and fix it if possible. I created an issue to track this - https://github.com/microsoft/FluidFramework/issues/9009