Password & Username verification should use regex

bobwaycott commented 6 years ago

So, I was randomly poking around in Flask-User today, curious about how passwords were verified & checked. I was really surprised when I found the code. What is currently being done to validate passwords and usernames is very inefficient & unnecessary. Regex is a perfect tool for validating passwords, usernames, emails, and a number of other string- and pattern-based rules.

I believe Flask-User is in the midst of active 0.9 development, and thought I'd just drop in a quick note that this could be drastically improved with regex.

For example, password/username validation can be simplified to something like this and still match the same rules:

from re import compile

PASSWORD_REGEX = compile(r'\A(?=\S*?\d)(?=\S*?[A-Z])(?=\S*?[a-z])\S{6,}\Z')
USERNAME_REGEX = compile(r'\A[\w\-\.]{3,}\Z')

def password_is_valid(password):
    return PASSWORD_REGEX.match(password) is not None

def username_is_valid(username):
    return USERNAME_REGEX.match(username) is not None

Just thought I'd bring it up for consideration. This really is the perfect use case for regex, and it's pretty odd to see passwords and usernames being validated by iterating over the entire string in Python code, counting how many times each character appears, and then checking those counts.

lingthio commented 6 years ago

Hi @bobwaycott , I'm torn on this issue. Your code is beautifully compact. I, however, am not fluent in Regex expressions, and from my experience, not many developers are. I have a very hard time even reading your expressions, let alone having to write one. When I look at the current Python code, I imagine that any Python developer can read, understand, and modify. I lean towards valuing readability over compactness in this case.

As a compromise, I will add your code to the code base as a comment. Thanks!

bobwaycott commented 6 years ago

I must admit I'm pretty surprised Flask-User would ignore a huge performance win—not to mention implementing something as common as a username & password validator using the right tool for the job (regex). Sure, regex can take a bit of getting used to when you don't deal with it constantly, but I disagree that many developers are wholly unfamiliar with regex. That hasn't been my experience over the last 10+ years.

I'm going to share a quick timeit readout on Flask-User's original validation functions, vs the regex-based versions. I've even made some quick & easy modifications to give the original code a speed boost in the _optimized variants below (there are some simple things Flask-User should be doing that it is not).

You'll find the original versions here as the _noregex functions. They've only been modified to accept username & password args directly (instead of accessing them via forms), and I also changed them to return True/False instead of raising exceptions (the goal being to let the functions exit as quickly as they possibly can without incurring more overhead). Additionally, to give each function a relatively level playing field, I've set 8-char & 16-char versions of both a username & password arguments as constants so they wouldn't be created each run when timing.

Here is the code being used:

from re import compile

VALID_CHARS = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._'
PASSWORD_REGEX = compile(r'\A(?=\S*?\d)(?=\S*?[A-Z])(?=\S*?[a-z])\S{6,}\Z')
USERNAME_REGEX = compile(r'\A[\w\-\.]{3,}\Z')

USER8 = 'username'
USER16 = 'usernameusername'
PASSWORD8 = 'password'
PASSWORD16 = 'passwordpassword'

def password_regex(password):
    return PASSWORD_REGEX.match(password) is not None

def username_regex(username):
    return USERNAME_REGEX.match(username) is not None

def password_noregex(password):
    """ Password must be 6+ chars and have 1 lowercase letter, 1 uppercase letter and 1 digit. """
    # Convert string to list of characters
    password = list(password)
    password_length = len(password)

    # Count lowercase, uppercase and numbers
    lowers = uppers = digits = 0
    for ch in password:
        if ch.islower(): lowers+=1
        if ch.isupper(): uppers+=1
        if ch.isdigit(): digits+=1
    return password_length >= 6 and lowers and uppers and digits

def username_noregex(username):
    """ Username must be at least 3 alphanumeric characters long & may contain . - _ chars."""
    if len(username) < 3:
        return False
    valid = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._'
    chars = list(username)
    for char in chars:
        if char not in valid:
          return False
    return True

def password_noregex_optimized(password):
    """ Password must be
        - 6 chars or more
        - 1 lowercase letter
        - 1 uppercase letter
        - 1 digit
    """
    # Bail now if password is < 6 chars
    if len(password) < 6:
        return False
    # Count lowercase, uppercase and numbers
    lowers = uppers = digits = 0
    for ch in password:
        if ch.islower(): lowers+=1
        if ch.isupper(): uppers+=1
        if ch.isdigit(): digits+=1
    return lowers and uppers and digits

def username_noregex_optimized(username):
    """ Username must cont at least 3 alphanumeric characters long"""
    if len(username) < 3:
        return False
    for char in username:
        if char not in VALID_CHARS:
          return False
    return True

Here are the results of timing the functions with as level a playing field as is possible (stated & shown in code above).

Bottom line

Method	Length	Original (µs)	Optimized (µs)	Regex (µs)
username_validator	8	2070	1720	713
username_validator	16	3160	2870	810
password_validator	8	5670	4930	895
password_validator	16	9950	9050	1160

Regex gives you a substantial performance improvement. And I haven't even optimized the regex pattern (one easy win would be to check the string length first to reject too-short usernames/passwords before checking if characters are valid).

More importantly: Regex-based validation time does not increase linearly with the length of the string, whereas per-character checking in a Python loop does.

Original functions

>>> %timeit for x in xrange(0, 1000): username_noregex(USER8)
100 loops, best of 3: 2070µs per loop

>>> %timeit for x in xrange(0, 1000): username_noregex(USER16)
100 loops, best of 3: 3160µs ms per loop

>>> %timeit for x in xrange(0, 1000): password_noregex('asdf')
100 loops, best of 3: 3520µs per loop

>>> %timeit for x in xrange(0, 1000): password_noregex(PASSWORD8)
100 loops, best of 3: 5670µs ms per loop

>>>  %timeit for x in xrange(0, 1000): password_noregex(PASSWORD16)
100 loops, best of 3: 9950µs ms per loop

Just trying to bail on a too-short password takes 3.5ms!

Slightly optimized versions of original functions

>>> %timeit for x in xrange(0, 1000): username_noregex_optimized(USER8)
1000 loops, best of 3: 1720µs per loop

>>> %timeit for x in xrange(0, 1000): username_noregex_optimized(USER16)
100 loops, best of 3: 2870µs per loop

>>> %timeit for x in xrange(0, 1000): password_noregex_optimized(PASSWORD8)
100 loops, best of 3: 4930µs per loop

>>> %timeit for x in xrange(0, 1000): password_noregex_optimized(PASSWORD16)
100 loops, best of 3: 9050µs per loop

Using `re.compile`d regex patterns

>>> %timeit for x in xrange(0, 1000): username_regex(USER8)
1000 loops, best of 3: 713 µs per loop

>>> %timeit for x in xrange(0, 1000): username_regex(USER16)
1000 loops, best of 3: 810 µs per loop

>>> %timeit for x in xrange(0, 1000): password_regex(PASSWORD8)
1000 loops, best of 3: 895µs per loop

>>> %timeit for x in xrange(0, 1000): password_regex(PASSWORD16)
1000 loops, best of 3: 1160µs per loop

Optimizing notes

Flask-User is instantiating a string of valid characters every time a username is validated. This could be a constant so it only gets created once (and could potentially be exposed as an overridable setting to those who use Flask-User).

The current username/password validators insist on turning supplied usernames & passwords into a list. However, str and unicode objects are inherently iterable in Python, meaning Flask-User is actually hurting its performance by coercing usernames/passwords to a list. Here's a quick example that shows the results of iterating over a string vs iterating over a string that is first converted to a list. Notice we're incurring (in this timed test) a 40µs overhead with the list call.

>>> %timeit for x in range(100): [char for char in 'username']
10000 loops, best of 3: 139 µs per loop

>>> %timeit for x in range(100): [char for char in list('username')]
10000 loops, best of 3: 179 µs per loop

Extra improvement thought

If Flask-User used regex-based username/password validation, it would be trivial & a pretty nice win to expose the USERNAME_REGEX and PASSWORD_REGEX patterns as config options—allowing anyone who uses Flask-User to easily implement their own validation rules and do so in a way that is industry standard, and performance-minded.

onomojo commented 6 years ago

Validating via regex is the right approach.

lingthio / Flask-User