Open tublitzed opened 4 years ago
@jchaupitre I can help with with the first item on behalf of @jvehent
Thanks, @ameihm0912, looking forward to working with you on this.
As a first ask, would you be able to share with me an existing policy that we have and is similar (somehow) with what we are trying to do here? Otherwise, would you know who we could contact?
@jchaupitre if you don't mind I'll grab a quick time slot next week to make sure I understand what we need; then I can start gathering resources for you. I suspect what we may want here is a mix of some high level policy statements and some of our lower level control definitions and behavioral patterns.
@davismtl, here are the summary statistics for the different collections on the decommissioned sync node 725 -- Getting statistics across all nodes will be impossible, so I'm hoping we can use this one as a sample, hopefully representative, data to drive our discussions:
Forms Forms.Size Min. : 1.0 Min. : 187 1st Qu.: 9.0 1st Qu.: 2060 Median : 115.0 Median : 25338 Mean : 1327.8 Mean : 279384 3rd Qu.: 854.8 3rd Qu.: 184872 Max. :1482688.0 Max. :350418052 NA's :103478 NA's :103478
History History.Size Min. : 1 Min. : 187 1st Qu.: 71 1st Qu.: 38832 Median : 606 Median : 321589 Mean : 2127 Mean : 1093897 3rd Qu.: 2534 3rd Qu.: 1326376 Max. :267939 Max. :145262334 NA's :191629 NA's :191629
Bookmarks Bookmarks.Size Min. : 1.0 Min. : 405 1st Qu.: 10.0 1st Qu.: 3870 Median : 16.0 Median : 6485 Mean : 246.4 Mean : 111028 3rd Qu.: 51.0 3rd Qu.: 22453 Max. :1573713.0 Max. :380801236 NA's :86994 NA's :86994
Prefs Prefs.Size
Min. :1 Min. : 8123
1st Qu.:1 1st Qu.: 12711
Median :1 Median : 13499
Mean :1 Mean : 13066
3rd Qu.:1 3rd Qu.: 13607
Max. :1 Max. :370899
NA's :109765 NA's :109765
Passwords Passwords.Size Min. : 1.00 Min. : 187 1st Qu.: 2.00 1st Qu.: 1122 Median : 7.00 Median : 3764 Mean : 31.99 Mean : 17631 3rd Qu.: 27.00 3rd Qu.: 14750 Max. :6126.00 Max. :3231694 NA's :110760 NA's :110760
Addons Addons.Size Min. : 1.00 Min. : 251 1st Qu.: 1.00 1st Qu.: 359 Median : 2.00 Median : 674 Mean : 3.83 Mean : 1318 3rd Qu.: 4.00 3rd Qu.: 1456 Max. :401.00 Max. :121667 NA's :180637 NA's :180637
@davismtl, here's the quantile stats for the above data set (i.e. decommissioned sync node 725):
% Forms Forms.Size 25% 9.00 2060.0 50% 115.00 25338.5 75% 854.75 184872.5 90% 3357.00 709535.0 99% 17973.00 3744529.4
% History History.Size 25% 71.00 38832 50% 606.00 321589 75% 2534.00 1326376 90% 5889.00 3038440 99% 18050.46 9090346
% Bookmarks Bookmarks.Size 25% 10.00 3870 50% 16.00 6485 75% 51.00 22453 90% 219.00 100038 99% 2510.11 1171953
% Prefs Prefs.Size 25% 1 12711 50% 1 13499 75% 1 13607 90% 1 14395 99% 1 15291
% Passwords Passwords.Size 25% 2 1122.00 50% 7 3764.00 75% 27 14750.25 90% 81 44575.20 99% 357 198256.64
% Addons Addons.Size 25% 1 359 50% 2 674 75% 4 1456 90% 9 2952 99% 27 9190
We met with @ameihm0912 and @davismtl, and the following questions were raised:
Also discussed, that if we are to land such policy, there will be work to prevent these abusive accounts from being migrated, but ALSO, work on the client / service side to prevent users from breaching the policy after the migration and on a go forward basis.
Pinging @habib, @erkolson, and @Micheletto for visibility.
Statistics for newly decommissioned sync-node-560:
Forms Forms.Size Min. : 1 Min. : 187 1st Qu.: 15 1st Qu.: 3405 Median : 158 Median : 35076 Mean : 2082 Mean : 421249 3rd Qu.: 1304 3rd Qu.: 276591 Max. :365491 Max. :81099789 NA's :94018 NA's :94018
History History.Size
Min. : 1 Min. : 187
1st Qu.: 84 1st Qu.: 45116
Median : 938 Median : 491200
Mean : 3070 Mean : 1559658
3rd Qu.: 3929 3rd Qu.: 2039947
Max. :823982 Max. :385440602
NA's :204294 NA's :204294
Bookmarks Bookmarks.Size Min. : 1 Min. :2.790e+02 1st Qu.: 14 1st Qu.:5.523e+03 Median : 35 Median :1.496e+04 Mean : 771 Mean :2.616e+05 3rd Qu.: 156 3rd Qu.:6.866e+04 Max. :9856717 Max. :2.472e+09 NA's :73434 NA's :73434
Prefs Prefs.Size
Min. :1 Min. : 8083
1st Qu.:1 1st Qu.: 9895
Median :1 Median : 10963
Mean :1 Mean : 11626
3rd Qu.:1 3rd Qu.: 12839
Max. :1 Max. :412371
NA's :102756 NA's :102756
Passwords Passwords.Size Min. : 1.00 Min. : 187 1st Qu.: 3.00 1st Qu.: 1673 Median : 12.00 Median : 6716 Mean : 54.35 Mean : 29587 3rd Qu.: 57.00 3rd Qu.: 30752 Max. :4399.00 Max. :2388249 NA's :99509 NA's :99509
Addons Addons.Size Min. : 1.0 Min. : 251 1st Qu.: 2.0 1st Qu.: 546 Median : 3.0 Median : 1117 Mean : 6.5 Mean : 2210 3rd Qu.: 8.0 3rd Qu.: 2612 Max. :666.0 Max. :203930 NA's :164683 NA's :164683
Forms Forms.Size 25% 15.00 3405 50% 158.00 35076 75% 1304.00 276591 90% 5263.00 1072015 99% 29184.56 5786539
History History.Size 25% 84.00 45116 50% 938.00 491200 75% 3929.00 2039947 90% 8689.40 4423990 99% 23592.28 11688947
Bookmarks Bookmarks.Size
25% 14.0 5523.0
50% 35.0 14959.0
75% 156.0 68655.0
90% 563.2 253166.2
99% 5543.0 2421994.6
Prefs Prefs.Size 25% 1 9895.0 50% 1 10963.0 75% 1 12839.0 90% 1 13627.0 99% 1 15833.8
Passwords Passwords.Size 25% 3 1673.00 50% 12 6716.00 75% 57 30752.25 90% 154 84007.40 99% 508 277247.78
Addons Addons.Size 25% 2 546.00 50% 3 1117.00 75% 8 2612.00 90% 15 5120.00 99% 44 14693.22
Action items from meeting on 3/23
1.a - Submit a new telemetry request to measure the typical Sync usage as to determine what our sync user outliers are, and to define what is the "normal" use of sync. We want to measure data usage, sync occurrences, time to sync, number of devices, are they non-firefox clients, and any other relevant insights (Leif is our POC) - Julien 1.b - Determine if we capture the same usage between client side telemetry and service side, as to determine the amount of non Firefox client usage of sync. 2- Cost is definitely a reason to investigate abuse of sync, but also we want to determine how sync is used and abused (i.e. syncing between other browsers). What usage of sync is determined acceptable? - To be determined at later time
Action items update: 1.a submitted a new request to Data Org: https://jira.mozilla.com/browse/DO-177
Next update: We will follow up on the telemetry request with Data Org (not expected to be worked on immediately due to other priorities), and will assist them with setting it up once they can engage.
While testing whole node migrations of user data from sync-py to sync-rs, I've discovered some users take an exceptionally long time to migrate (2+ hours). I ran some queries to create a distribution of rows and users. I found that though the 99th percentile user on this particular node had ~110,000 rows, the 3 biggest users had more than 9 million rows.
There are many processes working in parallel to transfer the data but these extreme outlier users make it very difficult to create optimal workload allocations for parallel processes. i.e. If we have enough processes to migrate the entire node in 2.5 hours based on total rows, average throughput, and an even allotment of users across all processes -- the presence of one of these users in a process's allocation will cause it to take closer to 5 hours.
Some food for thought as we work on a policy. We should align with client-side integration since it doesn't sync all the data on a first sync. So if we decide to keep 1 year of data or 100k records, it doesn't mean that the client will ever use it.
I'm going to pick this one back up in light of the recently discovered spanner limitation. Update to come after reviewing options with product.
Adjusting the scope of this ticket to specifically track handling the spanner limitation.
from rbillings load test:
11:51 AM
rbillings @jrconlin results: SUCCESSES: 1027241 | FAILURES: 252 | WORKERS: 800
This was with 0.6.0 with quota disabled.
~Unfortunately, rbillings will be out tomorrow and ops is currently dealing with a higher priority issue, so load testing 0.6.0 with quota enabled will have to wait for Monday, Sept 28.~
Ops was able to enable quota on storage. Load testing results:
3:35 PM rbillings SUCCESSES: 1027815 | FAILURES: 132 | WORKERS: 800 looks good!
Which is inline with the prior results.
She noticed no timing differences with the quota enabled.
I'm seeing a probably unexpected failure case of the quota check from Vasilica's latest log:
https://bugzilla.mozilla.org/show_bug.cgi?id=1665917#c4
After a "successful" batch commit the sync client reports:
601543067675 Sync.Collection DEBUG POST success 200 https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/<id>/storage/bookmarks?batch=<batch_id>commit=true
1601543067676 Sync.Engine.Bookmarks DEBUG Records that will be uploaded again because the server couldn't store them: <bso_id>, <etc>
Looking at the Sync client, "Records that will be uploaded again because the server couldn't store them" seem to indicate the write was successful but included "failed" bsos.
I believe the "failed" bucket was populated here due to the quota check failing in batch append (or post_bsos). When this happens there's no indication to the client that the quota check failed (the "failed" bucket does include an error message, so the quota error might be included there but it's likely ignored), which I'd consider a bug.
We've already planned to deprecate populating of this "failed" bucket feature entirely -- I think we likely need to remove it now to fix this issue (I've added it to the checklist)
While this isn't ideal, it's probably not too much of a problem TBH - the next sync will explicitly retry those failed items once. However, if it's failing due to the quota, then the clients are forever going to retry anyway. The only "bad" thing happening here that wouldn't if this was fixed is that Firefox will write a json file with the failed guids.
Spanner has a limitation on how much data we can store per user collection. We need to find a way to move forward with the migration in a way that doesn't negatively impact spanner.
Affected users tracked in 1653022
This will track: