mozilla-services / services-engineering

Services engineering core repo - Used for issues/docs/etc that don't obviously belong in another repo.
2 stars 1 forks source link

Verify that user data is not completely re-synced post migration. #45

Open tublitzed opened 4 years ago

tublitzed commented 4 years ago

Plan here is as follows:

tublitzed commented 4 years ago

Just a note here for completeness - Erik moved my original sync user over to this test node here. I now have a new sync id for this test: 141727562.

Just verified that all content has been synced up to new node, logs live here.

tublitzed commented 4 years ago

New sync id is 142084482. Unfortunately, I did not find good news. We are in fact re-syncing everything. Logs live here

rfk commented 4 years ago

1586197410063 Sync.Resource DEBUG GET fail 404 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142084482/storage/meta/global 1586197410063 Sync.Resource WARN GET request to https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142084482/storage/meta/global failed with status 404

Your client fails to find the meta/global record on the server, which would definitely be enough to trigger a full sync. If meta/global is missing then no other data on the server can be trusted to be in a good or consistent state.

rfk commented 4 years ago

Can we confirm from server-side migration logs whether the record 142084482/storage/meta/global was successfully copied over from the previous node?

rfk commented 4 years ago

My first theory here is that perhaps the migration script is calculating the (fxa_uid, fxa_kid) pair in a different way than tokenserver will after the migration. So e.g. we might copy all their data over to spanner keyed by (fxa_uid, fxa_kid1) but then when they client sees the node migration event and hits tokenserver to get a new token, they end up looking for data under (fxa_uid, fxa_kid2), and it appears the node is empty.

erkolson commented 4 years ago

The client got a 404 when requesting the storage/meta/global. Here are the first 5 requests to the server:

timestamp,request,request_length,bytes_sent,status
2020-04-06 18:23:30.806275 UTC,GET /1.5/142084482/info/collections HTTP/1.1,1134.0,2.0,200
2020-04-06 18:23:30.895814 UTC,GET /1.5/142084482/info/configuration HTTP/1.1,1138.0,165.0,200
2020-04-06 18:23:30.985694 UTC,GET /1.5/142084482/storage/meta/global HTTP/1.1,1139.0,1.0,404
2020-04-06 18:23:31.077635 UTC,GET /1.5/142084482/storage/meta/global HTTP/1.1,1138.0,1.0,404
2020-04-06 18:23:31.195519 UTC,PUT /1.5/142084482/storage/meta/global HTTP/1.1,1333.0,13.0,200
pjenvey commented 4 years ago

thanks @rfk. We've spotted some issues w/ migrate_node's individual user migration argument. We'll be testing a new user against the fixed version.

Erik's moved this new user (142188710) to this test node.

Restored the med.json & Sync'd, logs live here

pjenvey commented 4 years ago

Erik migrated the user to 142198989/durable sync and made the Python node 401.

After the 401 error log, the client appears to have sync'd to the new durable sync node without a full reupload! Log

EDIT: It did reupload its Tabs (it originally had 1 tab, and reuploaded).

pjenvey commented 4 years ago

Though I notice one oddity, the Tabs collection in about:sync fails to load, reporting an Error:

Error: HMACMismatch: Record SHA256 HMAC mismatch: should be 9946063866ac59e0006d0abccb0d9b37ec3216fe57eab97fc5cc0a1d0480bbfd, is ed307f5234e611eb05431ff6a979e2948f25b541595c46fb3f41182257140d36

Restarting the browser lead to a sync error w/ the same HMACmismatch: Log

pjenvey commented 4 years ago

Tabs are currently set to be solely stored in memcached on the server cache_only_collections = tabs, so it makes sense that they would have been reuploaded. But this doesn't explain the HMACmismatch.

rfk commented 4 years ago

Hrm, the hmac mismatch is very strange indeed. If you use the about:sync addon, can you inspect the state of the "tabs" collection to make sure it only has a single record with id "CB3DaSRC1v6F", which is the one uploaded by your client?

rfk commented 4 years ago

I also wonder if we're seeing the HMAC error on the tabs collection because that's the only one for which we're trying to download any records - all other collections are (correctly) treated as not having any remote changes. Trying to add a second client to this cluster might result in in additional HMAC errors being surfaced.

rfk commented 4 years ago

can you inspect the state of the "tabs" collection to make sure it only has a single record with id "CB3DaSRC1v6F", which is the one uploaded by your client?

Actually, from the logs, I wonder if there is somehow a second item in the "tabs" collection. The logs around the HMAC error say this:

1586297619302   Sync.Engine.Tabs    DEBUG   First sync, uploading all items
1586297619302   Sync.Engine.Tabs    INFO    1 outgoing items pre-reconciliation
1586297619302   Sync.Engine.Tabs    TRACE   Downloading & applying server changes
1586297619302   FirefoxAccounts TRACE   not checking freshness of profile as it remains recent
1586297619302   Sync.BrowserIDManager   TRACE   _ensureValidToken already has one
1586297619368   Sync.Collection DEBUG   GET success 200 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142198989/storage/tabs?full=1&limit=1000
1586297619369   Sync.Engine.Tabs    TRACE   Ignoring incoming tab item because of its id: CB3DaSRC1v6F
1586297619369   Sync.Engine.Tabs    TRACE   Skipping reconciled incoming item CB3DaSRC1v6F

So it skips the incoming tab item for itself, which is good.

1586297619369   Sync.Service    INFO    Bad HMAC event detected. Attempting recovery or signaling to other clients.
[... crypto/keys refetching blah-blah goes here ...]
1586297619511   Sync.Engine.Tabs    INFO    Records: 0 applied, 0 successfully, 1 failed to apply, 1 newly failed to apply, 1 reconciled.

But here we see that it tried to apply a record, and failed to do so due to the HMAC error. Does that mean that there's second BSO in the tabs collection, of unknown origin?

To check my understanding: we expect that the migration script will not populate any items into the "tabs" collection, because that collection does not exist in MySQL on the storage nodes. Is that accurate?

pjenvey commented 4 years ago

If you use the about:sync addon, can you inspect the state of the "tabs" collection to make sure it only has a single record with id "CB3DaSRC1v6F", which is the one uploaded by your client?

about:sync isn't helpful here, it reports the HMACMismatch error and is stuck in the "Fetching records..." state for tabs, even after dismissing the error message.

rfk commented 4 years ago

Ugh, OK. Any chance you'd be willing to hit the server API directly and pull down the record(s) for inspection? :-)

pjenvey commented 4 years ago

Ugh, OK. Any chance you'd be willing to hit the server API directly and pull down the record(s) for inspection? :-)

Someone would need to help me figure out how to get the appropriate Authorization header needed to query the backend. However I was luckily able to grab this count from about:sync's logging of the /info/collection_counts response. It reported tabs: 4 surprisingly.

Is it possible we have stale tabs data from mysql? I'm not familiar w/ the cache_only_collections implementation, does it ever backup any of its data to mysql?

I'm also wondering the ramifications of migrated user data including entries for tabs in the metaglobal and I think /info/collections/ as well.

pjenvey commented 4 years ago

Someone would need to help me figure out how to get the appropriate Authorization header needed to query the backend.

Nevermind, think I've got the hang of this..

rfk commented 4 years ago

Is it possible we have stale tabs data from mysql? I'm not familiar w/ the cache_only_collections implementation, does it ever backup any of its data to mysql?

I don't think it will ever be written back to mysql. Can we look in the source tables on the old node to see if there were any there?

pjenvey commented 4 years ago

@rfk thanks for the hints

Need to confirm some of these details w/ ops but it seems this was due to stale tabs data in spanner.

For this migration test, the client was a new account: 1) New accounts default to a Spanner node 2) Then it was manually reasssigned to a legacy Python node 3) Test client would sync data there, then we'd manually migrate it from legacy back to Spanner. No tabs data would migrate since they're only in memcached 4) Then it was manually reassigned back to Spanner

If a client "Signs Out" of FxA/Sync while on the Python node, clients and forms collections perform a kind of "reset" on the client side and the accompanying records for the device are DELETEd on the server. In some cases we were doing a "Sign Out" to force the client to immediately catch the reassignment (restarting the browser seems to be another way to force it).

We had a couple problems here:

So stale tabs data was left around + I think combined w/ the "reset" -- this "reset" must change some state of the tabs collection used for HMAC validation. Does that make sense?

Another note: though we won't migrate tabs data, we'll migrate an entry for it in meta/global. Do you foresee any issues w/ that (we'll double check the behavior around this w/ more test test runs)?

rfk commented 4 years ago

Aha, yes I think this would explain it!

At step (1) the client would have created new bulk storage keys in /crypto/keys on the spanner node, and uploaded data (including tabs) encrypted under those keys k1.

At step (2) the client would have seen that it had been re-assigned to a new, empty storage node, and would have re-initialized the sync data including /crypto/keys, then re-uploaded its sync data encrypted under those keys k2.

At step (3) you copy the new /crypto/keys over to spanner, and the client happily continues reading and writing data using the same bulk storage keys k2.

Except for the "tabs" collection, where it finds the old data that we encrypted using the old storage keys k1, which it's not able to decrypt and which produces HMAC verification error.

I'm not sure what effect the explicit signing out would have had - perhaps if you get a node re-assignment while you're signed in, Firefox is able to re-use the previous keys from /crypto/keys, but if you're not signed in then it has to create some news ones.

Regardless, I think this is a plausible explanation, Thanks for taking the time to dig in to the details here!

tublitzed commented 4 years ago

Re-opening this one as it looks like we may have hit the issue again from yesterdays client testing

pjenvey commented 4 years ago

Logs for my last round of successful attempts (on stage):

pjenvey-stage-migrate1: Did a forceful "Sign out" out of Sync to force lookup back to the Spanner node after its migration pjenvey-stage-migrate2 pjenvey-stage-migrate3

All 3 of these test accounts migrated successfully.