pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Get ~all users to verify email addresses #3632

Closed dstufft closed 6 years ago

dstufft commented 6 years ago

We have a problem with a bit of our data, namely that due to historical reasons we have a fair amount of users in the database that do not have a verified primary email address. The side effect of this is that we're currently sending emails to email addresses that we have not had verified. This is a bad situation to be in, because in order to keep our bounce/spam rate low, we should be confirming all email addresses before sending email to them. In addition the way our bounce handling code works is it un-verifies the email address, which the intent was to stop sending email to it until the user has reverified their email address.

In total there are about 193k user accounts with a unverified email address for their primary address, and 44k that do have a verified email address for their primary account.

So we need to come up with a strategy to resolve this, because it's pretty important that we don't send email to unverified addresses.

Here's what I've come up with, but I'd like to see what other people think as well.

For background, the way activation worked on legacy PyPI was that when you registered, it added a One time token (OTK) to a separate table that stored (username, OTK, datetime). When you verified your email with PyPI it would delete the entry from this other table, so effectively this table acts as a list of user accounts that legacy PyPI registered, but whom never activated their account via legacy PyPI.

So that means we have accounts in 3 possible states:

The first state is the happy state, and we currently have 44k accounts in that state. Looking at the OTK table, there are currently ~135k rows, if we assume that 100% of them are for accounts that did not end up verifying via Warehouse instead, that means that we have 135k accounts in the second state, and ~58k accounts in the third state. Just to correlate this, we also have ~135k users who are not in the is_active state.

Thus my plan of action is:

The end result then is that through (1) and (2) people are heavily incentivized to keep a working, verified email address hooked up to their account, through (3) we hopefully prompt some number of people to look at their accounts and verify, through (4) we reduce the size of the affected accounts considerably, and through (5) we give accounts one last notification to verify their email address.

I believe that once we get to (3), we should disable sending emails to unverified addresses (except for the email sent in (5)).

A few open questions left that I'm not sure of:

  1. Once we disable sending emails to unverified addresses, what emails should still be sent? Off hand I can think of:
    • Email verification email (this one is obvious)
    • MAYBE Password reset email? I'm not sure about this one, certainly we should allow it until (5) above is complete, but once that is complete I'm not sure! It's something that would only occur if a user is trying to reset a password for an account, but if they haven't verified their email address it is an avenue for malicous users to spam someone else with our system [1].
  2. There are about 73 users whose primary email address is unverified, but whom have added a verified alternative email address. Do we want to do anything special with these users like automatically promote their verified email to primary? Or should we just them work through the above plan naturally?
  3. Similar to the above, do we want to do anything special if a user's email address gets unverified due to delivery issues/spam complaint and they have other verified emails on their account?
    • I think certainly if they marked one of our email as spam we shouldn't then pick another email address they had previously given us and start sending to that address instead. A Spam complaint is a pretty heavy handed signal to stop sending them email.
    • I think that perhaps if we un-verify their primary email address, it wouldn't be unreasonable to send an email to an alternative email address to tell them we did. I'm not sure though, and if we do how do we pick which verified address to send to if they have multiple? Or would we send to all of them?

[1] Of course the email verification email is also such an email, but ideally that email should be adjusted to include some verbiage about how to contact the administrators if they're getting those emails and we can blacklist their email address from being used? If we do that, perhaps something automated too that would allow users to stop these emails from being sent to them by clicking on a link and confirming it?

reaperhulk commented 6 years ago

This issue made me check my own account, where I found out I was not verified. However, I rarely log into pypi except via CLI tooling like twine to upload packages. I have no idea if I'm a typical user, but ideally there would be some way to communicate the need to confirm an email address to users using that path as well.

dstufft commented 6 years ago

@reaperhulk Yea, the step (2) would basically do that, although via making twine upload fail until you verified rather than by printing a nice message.

dstufft commented 6 years ago

To be clear, you'd get an error message that told you why it failed, but it wouldn't be "oh it worked, but also here's a thing you should do".

di commented 6 years ago

Start displaying a flash-message like warning at the top of every page load for logged in users without a verified primary email address with a call to action to get a verified email address as their primary email address.

Expand the limitations of not having a verified, primary address so that you cannot do much in the ways of project management without it. What exactly should be limited is on the table, but I think uploads in general should require a valid, verified email, and likely so should other actions like deletions, managing contributors, etc.

I think it'd make sense to just immediately redirect to a "verify your email address" screen after a successful login and preventing the user from doing anything in the UI until they do, and skipping a flash message entirely (as well as preventing uploads like we do new project registrations at the moment).

Take the other 58k people, and start slowly sending emails to them asking them to verify the email address on file. Tell them that unless they verify their address, this will be the last email address they get from us. Assuming steps 1-4 don't reduce the 58k number, if we sent to, 200 people a day, we'd be looking at processing the backlog in 8-9 months.

I think this is not unreasonable, though obviously it'd be better to get as many users verified as possible before resorting to this.

MAYBE Password reset email? I'm not sure about this one, certainly we should allow it until (5) above is complete, but once that is complete I'm not sure! It's something that would only occur if a user is trying to reset a password for an account, but if they haven't verified their email address it is an avenue for malicous users to spam someone else with our system [1].

Definitely should allow this until we get to (5). Afterwards probably as well? If a user has a single verified email, then accidentally marks it as spam (thus getting it unverfied), and they've forgotten their password, they've essentially permanently locked themselves out of their account.

There are about 73 users whose primary email address is unverified, but whom have added a verified alternative email address. Do we want to do anything special with these users like automatically promote their verified email to primary? Or should we just them work through the above plan naturally?

I think they should just go through the regular flow, although I'm curious why they exist because (at least in Warehouse) you can't make a non-verified email your primary address. Perhaps worth looking into more...

Similar to the above, do we want to do anything special if a user's email address gets unverified due to delivery issues/spam complaint and they have other verified emails on their account?

Thinking on this again, I agree that just making it unverified should suffice. I think all the locks that come with not having a verified primary email should get activated, and they'll need to re-verify that address or manually switch primary emails to unlock.

I think certainly if they marked one of our email as spam we shouldn't then pick another email address they had previously given us and start sending to that address instead. A Spam complaint is a pretty heavy handed signal to stop sending them email.

I think that perhaps if we un-verify their primary email address, it wouldn't be unreasonable to send an email to an alternative email address to tell them we did. I'm not sure though, and if we do how do we pick which verified address to send to if they have multiple? Or would we send to all of them?

I think we probably should never attempt to contact non-primary email addresses. If they get locked out of uploading, etc. due to not having a verified primary email, they should get pretty obvious messages why they can't do what they're trying to do.

dstufft commented 6 years ago

MAYBE Password reset email? I'm not sure about this one, certainly we should allow it until (5) above is complete, but once that is complete I'm not sure! It's something that would only occur if a user is trying to reset a password for an account, but if they haven't verified their email address it is an avenue for malicous users to spam someone else with our system [1].

Definitely should allow this until we get to (5). Afterwards probably as well? If a user has a single verified email, then accidentally marks it as spam (thus getting it unverfied), and they've forgotten their password, they've essentially permanently locked themselves out of their account.

Yea, they would have perma locked themselves out, but they could reach out to us and we could manually verify an email for them in the worst case.

dstufft commented 6 years ago

I think they should just go through the regular flow, although I'm curious why they exist because (at least in Warehouse) you can't make a non-verified email your primary address. Perhaps worth looking into more...

If you're one of the people who already have an unverified email as your primary address, and you add a verified email but you don't make it your primary.

di commented 6 years ago

If you're one of the people who already have an unverified email as your primary address, and you add a verified email but you don't make it your primary.

Seems like folks might be missing the fact that adding a new email and verifying it does not make it your primary. Maybe we should do this automatically if the primary is unverified?

brainwane commented 6 years ago

When should we start this whole process?

Start a campaign of blogs, tweets, mailing list posts, etc to ask users to verify their email addresses with PyPI.

PyCon would be a great time for us to spread this message. That's now under a month away.

dstufft commented 6 years ago

There's no technical reason to start messaging on a particular date, so really whenever folks think is a good time is fine.

brainwane commented 6 years ago

We discussed this issue in our weekly meeting last week. It sounds like a call-to-action about this won't necessarily fit in Dustin's talk at PyCon.

Ernest noted that we had excellent results when we asked people to verify before releasing a new package -- a little confusion, but no grouchiness. He suggested that we might want to go back, summarize how those responses went in issues, and improve our messaging before doing further publicity.

rspeer commented 6 years ago

May I suggest that you should increase the expiration time on the verification e-mail, which would probably increase the rate of people successfully completing the process?

There are things in my e-mail that are important enough that they have to be dealt with in 6 hours. PyPI is not one of them.

KOLANICH commented 6 years ago

I suggest not to verify email addresses, but just stop collecting them and implement signon and signup without any email verification, phone account verification, bank card verification, ID verification, fingerprint verification or DNA profile verification or anything of this kind of shit. Email is overcentralised and we must get rid of it. If tomorrow the email provider closed my access to email I would have lost all my accounts and all my online identity. So I prefer my accounts not to be bound to email. Just use a crypto key for both signup, signon and packages signing.

The end result then is that through (1) and (2) people are heavily incentivized to keep a working, verified email address hooked up to their account

I guess it heavily incentivizes not to use this service at all.

dstufft commented 6 years ago

https://github.com/pypa/warehouse/pull/4292 implements (2) of the published plan. I think that might be enough of a restriction in and of itself, since uploading a package is the primary thing people with "important" user accounts tend to do with their PyPI account that it's going to act as a pretty strong forcing function. Additionally, trying to turn UI items into errors is a lot harder then an API item, and the red banner at the top already acts as a guide to get people to verify their email.

fungi commented 6 years ago

Announcing solely via a banner on a WebUI assumes people use the WebUI with those accounts. It caught us by surprise since the account we use in our release automation (which ~nobody ever logs into the WebUI with) started getting its uploads rejected. The followup announcement to distutils-sig today was helpful, but would have been more helpful in advance of landing #4292.

Regardless, thanks for working on this--it's a great improvement!

di commented 6 years ago

@fungi There should have also been a error message when the upload failed, was that not shown in your automation logs?

fungi commented 6 years ago

Yep, the error message worked fine but turned it into a reactive scenario rather than a proactive one. In our case the people driving the release automation aren't the same as the people with access to the credentials and inbox for the account used by said automation, so release management activities had to be paused while a solution was coordinated with systems administrators.

In a positive note, this has made it apparent to me that our project should use a different E-mail address for that PyPI account than the one which also catches the massive backscatter from our code review system. ;)

brainwane commented 6 years ago

@fungi Sorry for the late notice! I agree that we should have spread the word more.

Per @ewdurbin :+1:

Ernest noted that we had excellent results when we asked people to verify before releasing a new package -- a little confusion, but no grouchiness. He suggested that we might want to go back, summarize how those responses went in issues, and improve our messaging before doing further publicity.

Anyone have any pointers to some of those responses? When did we make this change?

dstufft commented 6 years ago

5 days ago we moved from only blocking on new projects, to blocking on any attempt to upload anything. The banner warning at the top started on April 15, and I think that blocking on new projects happened prior to this ticket even being opened.

fungi commented 6 years ago

Also, to be clear, the account where this took us by surprise historically was not used to register/create new projects, but instead gets added to the access controls for projects created by other accounts. Perhaps a bit of a corner case, but explains why we wouldn't have noticed without explicit announcement.

dmerejkowsky commented 6 years ago

For point 3) I've taken the liberty of writing a short article on my blog and dev.to.

Hope this helps!

dstufft commented 6 years ago

Ok, I've done some digging here, and my estimations before were off I think (trying to remember how I arrived at the numbers above, and the best I can think of is I estimated poorly). So here are the new numbers:

Number of Users: 232131 ``` SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email WHERE accounts_user.id = accounts_email.user_id AND (accounts_user.date_joined < date '2018-02-18' OR accounts_user.date_joined IS NULL) AND accounts_email.primary = 't' AND accounts_email.unverify_reason IS NULL ```
Number of Users with Verified Primary Email Address: 45702 ``` SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email WHERE accounts_user.id = accounts_email.user_id AND (accounts_user.date_joined < date '2018-02-18' OR accounts_user.date_joined IS NULL) AND accounts_email.primary = 't' AND accounts_email.verified = 't' ```
Number of Users with Unverified Primary Email Address: 186429 ``` SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email WHERE accounts_user.id = accounts_email.user_id AND (accounts_user.date_joined < date '2018-02-18' OR accounts_user.date_joined IS NULL) AND accounts_email.primary = 't' AND accounts_email.verified = 'f' AND accounts_email.unverify_reason IS NULL ```
Number of Users with Unverified Primary Email Address NOT IN OTK table: 86187 ``` SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email WHERE accounts_user.id = accounts_email.user_id AND (accounts_user.date_joined < date '2018-02-18' OR accounts_user.date_joined IS NULL) AND accounts_email.primary = 't' AND accounts_email.verified = 'f' AND accounts_email.unverify_reason IS NULL AND NOT EXISTS ( SELECT 1 FROM rego_otk WHERE rego_otk.name = accounts_user.username ) ```
Number of Users with Unverified Primary Email Address IN OTK table: 100242 ``` SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email WHERE accounts_user.id = accounts_email.user_id AND (accounts_user.date_joined < date '2018-02-18' OR accounts_user.date_joined IS NULL) AND accounts_email.primary = 't' AND accounts_email.verified = 'f' AND accounts_email.unverify_reason IS NULL AND EXISTS ( SELECT 1 FROM rego_otk WHERE rego_otk.name = accounts_user.username ) ```

A few important notes:

So my new numbers greatly expand upon the number of users that would be emailed in step 5 above, to the point that I'm concerned about the number of emails we would have to send. So I've been thinking about how we can modify the plan above, and I think that with two modifications, we should be back on track:

The above changes eliminates the need to send email to someone who isn't currently capable of managing a project, which is expanded out from anyone who is currently uploading to a project. It also limits sending emails strictly to people who actually have projects under their control, which are the people we truly care about having an email address on file with anyways (all of our notifications have to do with project administration, currently at least).

All in all, these changes would mean that instead of sending an email to 86,187 people, we are instead going to be sending an email to 37,838 which brings our backlog down to 6-7 months instead of 8-9 months. This change would also catch more projects (if we had implemented it first before the upload restrictions, it would have caught the Openstack example above, assuming they had a new project at any time).

Thoughts?

Query for the ~37k Users ``` SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email WHERE accounts_user.id = accounts_email.user_id AND (accounts_user.date_joined < date '2018-02-18' OR accounts_user.date_joined IS NULL) AND accounts_email.primary = 't' AND accounts_email.verified = 'f' AND accounts_email.unverify_reason IS NULL AND NOT EXISTS ( SELECT 1 FROM rego_otk WHERE rego_otk.name = accounts_user.username ) AND EXISTS ( SELECT 1 from roles WHERE roles.user_name = accounts_user.username ) ```
fungi commented 6 years ago

Sounds like an excellent next step.

I'd like to think we'd have noticed if direct E-mails were sent earlier, but to be honest we chose poorly on what address to associate with that automation account (years ago) and the direct notification would likely have been buried in a ton of noise (since remedied this week). The modification to step #2 would certainly have come to our attention though as that account is added to more projects at least weekly.

Regardless, given the point of having verified addresses for uploaders is to be able to reliably contact them, it's entirely reasonable for PyPI naintainers to consider a notice sent to those addresses as sufficient due diligence for such a behavior change.

dstufft commented 6 years ago

Ok, https://github.com/pypa/warehouse/pull/4322 prevents users from being added to a project unless they have a verified primary email address.

dstufft commented 6 years ago

One further modification, if I remove the qualifications for date_joined and not existing in rego_otk from my query, then the ~37k people increases to ~38k people, so I think that it makes sense to not limit it to accounts from before the switch.

Number of Users to email: 38,824 ``` SELECT COUNT(DISTINCT accounts_user.id) FROM accounts_user, accounts_email WHERE accounts_user.id = accounts_email.user_id AND accounts_email.primary = 't' AND accounts_email.verified = 'f' AND accounts_email.unverify_reason IS NULL AND EXISTS ( SELECT 1 from roles WHERE roles.user_name = accounts_user.username ) ```
mlissner commented 6 years ago

I think I read through all of this, but I remain pretty annoyed that things just broke for my automated upload system.

Why can't we send emails before we start breaking things for people? Seems to me that we should do whatever we can to start emailing unverified accounts now before more people have to stop what they're doing in their otherwise productive day, to figure out what the hack is going on.

One other note: I pretty much never log into the website, so this is kind of a special opportunity for us, since a lot of the pypi users will suddenly be adjusting their accounts. I know that these days whenever I'm tinkering with my credentials, verifying accounts, etc, I make sure I have 2FA enabled while I'm there. It would great to have 2FA ready to go before embarking on this ticket any further. If that's not far off, would it be crazy to do something like:

  1. Stop blocking uploads for a bit. This is an annoying and aggressive step to take without emailing first.
  2. Get 2FA enabled
  3. Send all the emails to people asking them to verify (note that 2FA is now available)
  4. Send emails to already-verified accounts telling them 2FA is now available
  5. THEN: Start breaking things for people again

I note that last week Node had a crisis because somebody didn't have 2FA enabled and their account got phished. Time to up this priority?

brainwane commented 6 years ago

Hi, @mlissner! I'm sorry that we broke things and didn't announce stuff first.

From December 2017 till the end of April 2018, PyPI had a paid project manager (me) who made sure that we gave lots of advance warning before breaking stuff. Then the grant ran out and we have, as far as I know, no one paid to work on PyPI; volunteers are improving the software and infrastructure sides of things and sometimes the communications side doesn't catch up as fast. The Packaging Working Group is seeking donations and applying for further grants to fund more design work, more and faster development, and better project management.

I'm interested in your idea regarding hooking this process into #996 and ask @mschwager to comment.

mlissner commented 6 years ago

I feel ya, thanks @brainwane. Hopefully the outline I've got there isn't really much more work if we were planning to email at the end anyway. Mostly, it'd just be a re-ordering of things, I think. But I do understand resource constraints. I'm largely in the same boat.

dstufft commented 6 years ago

We're unlikely going to revert the changes at this point, as the primary thing we're waiting on before sending out the email is approval from the WG on spending money on sending out nearly 40k emails via MailChimp. Once that has been approved, then we're going to fairly soon after be sending out the email.

Part of that is because we need a pretty clear line in the sand of who we're emailing, and "the set of people able to do things to PyPI" is a pretty reasonable set of people. However, we don't want to allow that set of people to grow, because then it gets much harder to campaign to get people to verify (because now we have to track who we have set an email to in the past, and who we haven't).

Unfortunately, 2fa on PyPI is a non trivial amount of effort, because our uploading requires logging in with a username/password as well and doesn't have any mechanism in it to support 2fa. So we have a bit of a stack of yaks to shave before we're going to be able to meaningfully do that, and I don't think blocking this effort on that makes sense.

dstufft commented 6 years ago

Ok we've sent out the email to everyone, and stopped sending email to unverified email addresses, and it appears that people are indeed logging in and verifying their emails. We're down to 35k (from 37k) so far and that number is still dropping.

There's nothing else to be done on this issue, thanks everyone!

AkihiroSuda commented 6 years ago

@dstufft Sorry to be harsh, but your email really looked like a phishing mail :cry: Next time please consider removing all hyperlinks except https://pypi.org?

terryjreedy commented 6 years ago

In particular, the email tells people to log in and give their credentials at https://pypi.us18.list-manage.com/track/... .

di commented 6 years ago

Sorry folks, this was an oversight, we don't use MailChimp very often and didn't realize it would automatically wrap URLs in the message body with those tracking links.

brainwane commented 6 years ago

And I'll belatedly mention that in mid-July I sent an announcement email to pypi-announce and tweeted as @ThePyPA asking people to verify their email addresses, so that is done.