mozilla-services / services-engineering

Services engineering core repo - Used for issues/docs/etc that don't obviously belong in another repo.
2 stars 1 forks source link

[meta] sync quota tracking bug #41

Open tublitzed opened 4 years ago

tublitzed commented 4 years ago

Spanner has a limitation on how much data we can store per user collection. We need to find a way to move forward with the migration in a way that doesn't negatively impact spanner.

Affected users tracked in 1653022

This will track:

jchaupitre commented 4 years ago
ameihm0912 commented 4 years ago

@jchaupitre I can help with with the first item on behalf of @jvehent

jchaupitre commented 4 years ago

Thanks, @ameihm0912, looking forward to working with you on this.

As a first ask, would you be able to share with me an existing policy that we have and is similar (somehow) with what we are trying to do here? Otherwise, would you know who we could contact?

ameihm0912 commented 4 years ago

@jchaupitre if you don't mind I'll grab a quick time slot next week to make sure I understand what we need; then I can start gathering resources for you. I suspect what we may want here is a mix of some high level policy statements and some of our lower level control definitions and behavioral patterns.

jchaupitre commented 4 years ago

@davismtl, here are the summary statistics for the different collections on the decommissioned sync node 725 -- Getting statistics across all nodes will be impossible, so I'm hoping we can use this one as a sample, hopefully representative, data to drive our discussions:

Forms Forms.Size Min. : 1.0 Min. : 187 1st Qu.: 9.0 1st Qu.: 2060 Median : 115.0 Median : 25338 Mean : 1327.8 Mean : 279384 3rd Qu.: 854.8 3rd Qu.: 184872 Max. :1482688.0 Max. :350418052 NA's :103478 NA's :103478

History History.Size Min. : 1 Min. : 187 1st Qu.: 71 1st Qu.: 38832 Median : 606 Median : 321589 Mean : 2127 Mean : 1093897 3rd Qu.: 2534 3rd Qu.: 1326376 Max. :267939 Max. :145262334 NA's :191629 NA's :191629

Bookmarks Bookmarks.Size Min. : 1.0 Min. : 405 1st Qu.: 10.0 1st Qu.: 3870 Median : 16.0 Median : 6485 Mean : 246.4 Mean : 111028 3rd Qu.: 51.0 3rd Qu.: 22453 Max. :1573713.0 Max. :380801236 NA's :86994 NA's :86994

Prefs Prefs.Size
Min. :1 Min. : 8123
1st Qu.:1 1st Qu.: 12711
Median :1 Median : 13499
Mean :1 Mean : 13066
3rd Qu.:1 3rd Qu.: 13607
Max. :1 Max. :370899
NA's :109765 NA's :109765

Passwords Passwords.Size Min. : 1.00 Min. : 187 1st Qu.: 2.00 1st Qu.: 1122 Median : 7.00 Median : 3764 Mean : 31.99 Mean : 17631 3rd Qu.: 27.00 3rd Qu.: 14750 Max. :6126.00 Max. :3231694 NA's :110760 NA's :110760

Addons Addons.Size Min. : 1.00 Min. : 251 1st Qu.: 1.00 1st Qu.: 359 Median : 2.00 Median : 674 Mean : 3.83 Mean : 1318 3rd Qu.: 4.00 3rd Qu.: 1456 Max. :401.00 Max. :121667 NA's :180637 NA's :180637

jchaupitre commented 4 years ago

@davismtl, here's the quantile stats for the above data set (i.e. decommissioned sync node 725):

% Forms Forms.Size 25% 9.00 2060.0 50% 115.00 25338.5 75% 854.75 184872.5 90% 3357.00 709535.0 99% 17973.00 3744529.4

% History History.Size 25% 71.00 38832 50% 606.00 321589 75% 2534.00 1326376 90% 5889.00 3038440 99% 18050.46 9090346

% Bookmarks Bookmarks.Size 25% 10.00 3870 50% 16.00 6485 75% 51.00 22453 90% 219.00 100038 99% 2510.11 1171953

% Prefs Prefs.Size 25% 1 12711 50% 1 13499 75% 1 13607 90% 1 14395 99% 1 15291

% Passwords Passwords.Size 25% 2 1122.00 50% 7 3764.00 75% 27 14750.25 90% 81 44575.20 99% 357 198256.64

% Addons Addons.Size 25% 1 359 50% 2 674 75% 4 1456 90% 9 2952 99% 27 9190

jchaupitre commented 4 years ago

We met with @ameihm0912 and @davismtl, and the following questions were raised:

Also discussed, that if we are to land such policy, there will be work to prevent these abusive accounts from being migrated, but ALSO, work on the client / service side to prevent users from breaching the policy after the migration and on a go forward basis.

Pinging @habib, @erkolson, and @Micheletto for visibility.

jchaupitre commented 4 years ago

Statistics for newly decommissioned sync-node-560:

Forms Forms.Size Min. : 1 Min. : 187 1st Qu.: 15 1st Qu.: 3405 Median : 158 Median : 35076 Mean : 2082 Mean : 421249 3rd Qu.: 1304 3rd Qu.: 276591 Max. :365491 Max. :81099789 NA's :94018 NA's :94018

History History.Size
Min. : 1 Min. : 187 1st Qu.: 84 1st Qu.: 45116 Median : 938 Median : 491200 Mean : 3070 Mean : 1559658 3rd Qu.: 3929 3rd Qu.: 2039947 Max. :823982 Max. :385440602 NA's :204294 NA's :204294

Bookmarks Bookmarks.Size Min. : 1 Min. :2.790e+02 1st Qu.: 14 1st Qu.:5.523e+03 Median : 35 Median :1.496e+04 Mean : 771 Mean :2.616e+05 3rd Qu.: 156 3rd Qu.:6.866e+04 Max. :9856717 Max. :2.472e+09 NA's :73434 NA's :73434

Prefs Prefs.Size
Min. :1 Min. : 8083
1st Qu.:1 1st Qu.: 9895
Median :1 Median : 10963
Mean :1 Mean : 11626
3rd Qu.:1 3rd Qu.: 12839
Max. :1 Max. :412371
NA's :102756 NA's :102756

Passwords Passwords.Size Min. : 1.00 Min. : 187 1st Qu.: 3.00 1st Qu.: 1673 Median : 12.00 Median : 6716 Mean : 54.35 Mean : 29587 3rd Qu.: 57.00 3rd Qu.: 30752 Max. :4399.00 Max. :2388249 NA's :99509 NA's :99509

Addons Addons.Size Min. : 1.0 Min. : 251 1st Qu.: 2.0 1st Qu.: 546 Median : 3.0 Median : 1117 Mean : 6.5 Mean : 2210 3rd Qu.: 8.0 3rd Qu.: 2612 Max. :666.0 Max. :203930 NA's :164683 NA's :164683


Forms Forms.Size 25% 15.00 3405 50% 158.00 35076 75% 1304.00 276591 90% 5263.00 1072015 99% 29184.56 5786539

History History.Size 25% 84.00 45116 50% 938.00 491200 75% 3929.00 2039947 90% 8689.40 4423990 99% 23592.28 11688947

Bookmarks Bookmarks.Size 25% 14.0 5523.0 50% 35.0 14959.0
75% 156.0 68655.0 90% 563.2 253166.2
99% 5543.0 2421994.6

Prefs Prefs.Size 25% 1 9895.0 50% 1 10963.0 75% 1 12839.0 90% 1 13627.0 99% 1 15833.8

Passwords Passwords.Size 25% 3 1673.00 50% 12 6716.00 75% 57 30752.25 90% 154 84007.40 99% 508 277247.78

Addons Addons.Size 25% 2 546.00 50% 3 1117.00 75% 8 2612.00 90% 15 5120.00 99% 44 14693.22

jchaupitre commented 4 years ago

Action items from meeting on 3/23

1.a - Submit a new telemetry request to measure the typical Sync usage as to determine what our sync user outliers are, and to define what is the "normal" use of sync. We want to measure data usage, sync occurrences, time to sync, number of devices, are they non-firefox clients, and any other relevant insights (Leif is our POC) - Julien 1.b - Determine if we capture the same usage between client side telemetry and service side, as to determine the amount of non Firefox client usage of sync. 2- Cost is definitely a reason to investigate abuse of sync, but also we want to determine how sync is used and abused (i.e. syncing between other browsers). What usage of sync is determined acceptable? - To be determined at later time

  1. Firefox client/service has a built-in quota (e.g. users get 1 MB of sync data) that was never used - Is that something we could be using here? We could start with a very large number like 100MB, and measure how often users reach that limit. - Need to confirm with Sync-client (Janet) - Julien
  2. Should we consider white listing clients, i.e. 3rd party browser, firefox forks, etc? this would be based on user agent which could be spoofed. Dependent on 1.a and 1.b - need to confirm with Services-engineering and/or ops (Rachel) how to implement this - Julien
jchaupitre commented 4 years ago

Action items update: 1.a submitted a new request to Data Org: https://jira.mozilla.com/browse/DO-177

  1. Sync Quota implementation - Client and Service engineering team confirmed this was NOT implemented on either side. Specifically Client team removed that UI a long time ago on Firefox Desktop because it wasn't used (https://bugzilla.mozilla.org/show_bug.cgi?id=1180587). Client team also assumes neither android or iOS version supports this. Service team indicated they have an issue open to impl. it https://github.com/mozilla-services/syncstorage-rs/issues/120.
  2. Confirmed with Rachel that Services-engineering team can be engaged to investigate how we could build capability to block non-firefox clients to use sync services.

Next update: We will follow up on the telemetry request with Data Org (not expected to be worked on immediately due to other priorities), and will assist them with setting it up once they can engage.

erkolson commented 4 years ago

While testing whole node migrations of user data from sync-py to sync-rs, I've discovered some users take an exceptionally long time to migrate (2+ hours). I ran some queries to create a distribution of rows and users. I found that though the 99th percentile user on this particular node had ~110,000 rows, the 3 biggest users had more than 9 million rows.

Migration time:

  1. mean user - ~4s
  2. 99th percentile user - 0:01:40 (100s)
  3. 9,000,000 row user - 2:06:21 (8181s)

There are many processes working in parallel to transfer the data but these extreme outlier users make it very difficult to create optimal workload allocations for parallel processes. i.e. If we have enough processes to migrate the entire node in 2.5 hours based on total rows, average throughput, and an even allotment of users across all processes -- the presence of one of these users in a process's allocation will cause it to take closer to 5 hours.

davismtl commented 4 years ago

Some food for thought as we work on a policy. We should align with client-side integration since it doesn't sync all the data on a first sync. So if we decide to keep 1 year of data or 100k records, it doesn't mean that the client will ever use it.

tublitzed commented 4 years ago

I'm going to pick this one back up in light of the recently discovered spanner limitation. Update to come after reviewing options with product.

tublitzed commented 4 years ago

Adjusting the scope of this ticket to specifically track handling the spanner limitation.

jrconlin commented 4 years ago

from rbillings load test:

11:51 AM
rbillings @jrconlin results: SUCCESSES: 1027241 | FAILURES: 252 | WORKERS: 800

This was with 0.6.0 with quota disabled.

jrconlin commented 4 years ago

~Unfortunately, rbillings will be out tomorrow and ops is currently dealing with a higher priority issue, so load testing 0.6.0 with quota enabled will have to wait for Monday, Sept 28.~

jrconlin commented 4 years ago

Ops was able to enable quota on storage. Load testing results:

3:35 PM rbillings SUCCESSES: 1027815 | FAILURES: 132 | WORKERS: 800 looks good!

Which is inline with the prior results.

She noticed no timing differences with the quota enabled.

pjenvey commented 4 years ago

I'm seeing a probably unexpected failure case of the quota check from Vasilica's latest log:

https://bugzilla.mozilla.org/show_bug.cgi?id=1665917#c4

After a "successful" batch commit the sync client reports:

601543067675    Sync.Collection DEBUG   POST success 200 https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/<id>/storage/bookmarks?batch=<batch_id>commit=true
1601543067676   Sync.Engine.Bookmarks   DEBUG   Records that will be uploaded again because the server couldn't store them: <bso_id>, <etc>

Looking at the Sync client, "Records that will be uploaded again because the server couldn't store them" seem to indicate the write was successful but included "failed" bsos.

I believe the "failed" bucket was populated here due to the quota check failing in batch append (or post_bsos). When this happens there's no indication to the client that the quota check failed (the "failed" bucket does include an error message, so the quota error might be included there but it's likely ignored), which I'd consider a bug.

We've already planned to deprecate populating of this "failed" bucket feature entirely -- I think we likely need to remove it now to fix this issue (I've added it to the checklist)

mhammond commented 4 years ago

While this isn't ideal, it's probably not too much of a problem TBH - the next sync will explicitly retry those failed items once. However, if it's failing due to the quota, then the clients are forever going to retry anyway. The only "bad" thing happening here that wouldn't if this was fixed is that Firefox will write a json file with the failed guids.