Handle crash window where anon_id isn't updated after password reset

data-sync-user commented 4 years ago

There is a small window between a successful password reset and a successful reset of the anon_id. If something were to crash or fail during this time, the user’s password would be reset but their anon_id would not be rotated properly.

This could be considered a partial spin-off of #6017 but the scope may extend into other areas.

Originally it was suggested that we could clear the old anon_id value because at the next sign-in (or fetch of the user profile), we could then generate and set a “real” value. However, we want to maintain anon_id continuity across password resets. One solution could be to send a placeholder anon_id into the db at password reset, marked as “dirty” or to be replaced by the client as soon as possible.

The solution could look like this:

client sends a password reset request with a generated placeholder anon_id
server accepts password reset, stores placeholder anon_id in account table, and notifies data pipeline of the (old, placeholder) pair
client regenerates the real anon_id, typically immediately but if a crash happens, on fetch of next profile
client uploads the new real anon_id
server accepts the new anon_id and notifies the data pipeline of the (placeholder, new) pair. If this fails for some reason, maybe retry with exponential backoff

┆Issue is synchronized with this Jira Task ┆Issue Number: FXA-1236

LZoog commented 4 years ago

@6a68 I just assigned this to myself but after rereading the ask a couple times, I want to make sure I've got this right - we're clearing... the anon_id? We'll hit generateEcosystemAnonId from a password reset here which updates the value here so this feels mostly done already. Is this issue essentially to double check the old one is donezo?

Edit: Hmmm, rereading the title, I realize it says where it isn't updated, so maybe this an edgecase of #6017. Makes more sense, we want to make sure the old value remains the same if there's an error in anon_id creation/update - let me know if that's incorrect.

LZoog commented 4 years ago

Re: comment above - the description has been updated per Slack conversations. cc @6a68 to check my accuracy of summarizing ;)

I’m not sure how to generate a placeholder or how to know if this placeholder is not a real anon_id for checking if we need to refetch on profile load, but it might be more obvious once I dig in.

maybe retry with exponential backoff

To the last point, if we’re doing a check on profile reload each time to see if it needs to be regenerated, maybe we don’t even need an exponential backoff but just let it try again on next profile reload. I’d imagine this would be a pretty rare occurrence.

jaredhirsch commented 4 years ago

Your summary in the description looks accurate to me :+1: but this is one of those things we didn't fully work out ahead of time, and I think we missed something.

I’m not sure how to generate a placeholder

This part would be easy, just use crypto.getRandomValues. There might even be a nice wrapper in the content server's app/scripts/lib/crypto or in the new fxa-auth-client (formerly fxa-js-client, just landed).

how to know if this placeholder is not a real anon_id

Yup, this is the missing piece.

By the magic of crypto, the random placeholder and anon_id are indistinguishable. The only way to know a placeholder value is a placeholder is either to generate the anon_id and compare the two values, or to mark the account as having a stale/placeholder anon_id. (Or some third approach I haven't thought of.)

I think generating the anon_id and checking against the stored value is a nicer approach, because it doesn't introduce new bits of state, and because it reduces the window of placeholder usage, whether that is the AET bootstrapping placeholder uploaded by Sync, or the password reset placeholder uploaded by FxA. But it entails checking the anon_id against the stored value each time the user enters their password, something we just decided not to do.

On the other hand, if we want to instead put some state on the account to mark an anon_id as stale/placeholder, we'd have to either add a field to the account (mysql changes), or make a new list in redis of affected users or anon_ids, and then (handwaving here) add some API to allow the front-end to ping the server to ask if the user has a 'real' anon_id. We would also never replace the Sync-generated placeholder until a password reset, which doesn't seem ideal. This 'storing extra state' approach seems like more work and more complexity, but I could be missing something; happy to discuss further.

Now, it's possible there is some still-missing edge case here, because this is an incomplete part of the design: password resets aren't yet documented in the main technical doc for AET, and the last time @rfk and I talked about this, we talked about atomically clearing the anon_id on password reset, but not about maintaining data continuity across that flow.

I'm not sure just where in the account model we'll want to do this more eager anon_id generation and checking, but I bet @vbudhram can help.

jaredhirsch commented 4 years ago

Ah, another point I just thought of: we want to slowly enroll users in AET, or at least be able to control the rate at which the whole set of FxA users is enrolled in AET and pings start flowing in. So, when we do the anon_id comparison, I think we should just bail early if the current value is falsy.

This seems like a decision we'd want to take to the larger AET group to make sure it sounds good to everyone, and it might be something we revisit later.

jaredhirsch commented 4 years ago

I am far from certain, but it looks like this promise chain might be the right spot to add the anon_id check-and-regenerate call: https://github.com/mozilla/fxa/blob/main/packages/fxa-content-server/app/scripts/views/complete_reset_password.js#L181

jaredhirsch commented 4 years ago

Ah, nope. Looks like account model has both the resetPassword (start of reset) and completePasswordReset (end of reset) methods, that's probably the right level for anon_id twiddling (the views don't need to know about the anon_id at all)

vbudhram commented 4 years ago

we want to slowly enroll users in AET

@6a68 In train 181, the AET values will get set on password change and reset. Do we want to disable this until the sync placeholder parts in ready?

https://github.com/mozilla/fxa/blob/main/packages/fxa-content-server/app/scripts/views/complete_reset_password.js#L181

Yea that looks right too me 👍🏽

jaredhirsch commented 4 years ago

we want to slowly enroll users in AET

@6a68 In train 181, the AET values will get set on password change and reset. Do we want to disable this until the sync placeholder parts in ready?

Good question, I've been thinking about this a bit. If the existing anon_id is falsy, we can just do nothing. I think that'll solve the gradual onboarding problem.

https://github.com/mozilla/fxa/blob/main/packages/fxa-content-server/app/scripts/views/complete_reset_password.js#L181

Yea that looks right too me 👍🏽

Great! Thanks :beers:

LZoog commented 4 years ago

@6a68 In train 181, the AET values will get set on password change and reset.

This should only be in staging I believe. The data pipeline public keys are set to an empty array in production right now.

vbudhram commented 4 years ago

public keys are set to an empty array in production right now.

Ah, yup then we should be good

jaredhirsch commented 4 years ago

@dannycoates suggested we skip the placeholder and just update anon_id when we set the new password. Seems like an improvement to me 👍

LZoog commented 4 years ago

just update anon_id when we set the new password

@6a68 I'm not sure I understand what this means or Danny's comment on "It’s ok to use anon1 while the reset is in progress" - aren't we setting the password at the end of a password reset? Do we mean with PW reset we'll send up the real, new anon_id instead of a placeholder (in which case kB will change after PW reset so ? 🤔), or that we aren't going to update the anon_id during a password reset but only on regular password change when kB doesn't change so we can generate and send up in the same request?

jaredhirsch commented 4 years ago

I think the idea is, rather than:

begin password reset, replace old anon_id with placeholder
complete password reset
detect placeholder and replace with new anon_id

Danny's suggestion is to just

start password reset, but don't unset anon_id
complete password reset, and explicitly recalculate anon_id when the new password is entered

A little more explicitly, I think this means the only time we'd generate a new anon_id would be inside the reset completion step, somewhere in this promise chain https://github.com/mozilla/fxa/blob/main/packages/fxa-content-server/app/scripts/views/complete_reset_password.js#L181

dannycoates commented 4 years ago

Forgive me for the drive-by engineering, I'm working on Send at the moment and not keeping up with the convo, but yeah, Jared's got the gist. Ideally, (not necessarily now) I think it'd be best to push the eco(Anon)Id handling all the way down into fxa-auth-client. With resetPasswordWithRecoveryKey and passwordChange being the main functions where we'd trigger calculating new ones.

LZoog commented 4 years ago

Forgive me for the drive-by engineering

Nothing to forgive, that's one reason the team swarming approach can be great. This is a much simpler approach so I appreciate and thank you for seeing this and dropping in before I'd started. 😁

data-sync-user commented 4 years ago

➤ Jared Hirsch commented:

After team discussion, this doesn't seem essential for the scoped keys epic. Moving out to a future placeholder epic.

LZoog commented 4 years ago

In a discussion earlier today, we determined this issue should be punted out to another epic where we tackle frequent AET identifier rotations.

I started moving the call to update the anon_id into fxa-auth-client where we send the request for password reset/change, but this has a couple of issues:

We’d need to communicate back to the account model the change occurred and set the ID
While it may slightly decrease the crash window, it doesn't really address the problem because the requests are still separate so the password could be updated while the anon_id could still potentially fail to update

It may make sense later to move all the handling into fxa-auth-client, but we also like the separation keeping it in the account model gives us and that the account model is already setup to work with localstorage. This is more of an issue with password reset than it is with changePassword due to a new kB, but at this point in time, the crash window appears to be a non-issue because of how infrequently we are rotating and how small the window is.

The longer term solution for this when we rotate keys opportunistically when users enter their passwords could be to add the anon_id to at least the the PW reset request so that there is only one network request instead of two.

It was also discussed to make sure we send the correct headers when we send the update from the client so that when we do update we can ensure there’s not two clients that disagree on the value, but this doesn’t matter right now because the clients can keep using the anon_id and AET will still unwrap them to the same eco ID.

We discussed adding additional logging but decided it was unnecessary since we log successful password changes/resets and when the anon_id is updated.

data-sync-user commented 3 years ago

➤ Jody Heavener commented:

AET is no longer on our roadmap.

mozilla / fxa

Handle crash window where anon_id isn't updated after password reset #4329