python / cpython

The Python programming language
https://www.python.org
Other
63.11k stars 30.22k forks source link

_strptime.TimeRE should not enforce range in regex #69117

Open 0ffa5ce0-07b2-47b7-90fe-79403f0626b8 opened 9 years ago

0ffa5ce0-07b2-47b7-90fe-79403f0626b8 commented 9 years ago
BPO 24929
Nosy @bitdancer, @catherinedevlin, @pganssle
PRs
  • python/cpython#26215
  • Files
  • file
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library'] title = '_strptime.TimeRE should not enforce range in regex' updated_at = user = 'https://bugs.python.org/SteveYeung' ``` bugs.python.org fields: ```python activity = actor = 'catherinedevlin' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Steve Yeung' dependencies = [] files = ['40319'] hgrepos = [] issue_num = 24929 keywords = ['patch'] message_count = 8.0 messages = ['249074', '249133', '249141', '249518', '249541', '393904', '393948', '394146'] nosy_count = 5.0 nosy_names = ['catherinedevlin', 'r.david.murray', 'Catherine.Devlin', 'Steve Yeung', 'p-ganssle'] pr_nums = ['26215'] priority = 'low' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue24929' versions = ['Python 3.6'] ```

    0ffa5ce0-07b2-47b7-90fe-79403f0626b8 commented 9 years ago

    Currently, the regex in TimeRE enforces the numeric ranges. For example: 'm': r"(?P\<m>1[0-2]|0[1-9]|[1-9])",

    As a result, an invalid month will cause an generic regex error: ValueError: time data '2015/16/5' does not match format '%Y/%m/%d'

    However, if we relax the regex to not check the range and allow datetime to handle it: 'm': r"(?P\<m>\d{1,2})"

    The error will be handle in datetime instead and the error will be much more helpful: ValueError: month must be in 1..12

    Please consider relaxing the regex for numeric ranges in _strptime.TimeRE.

    brettcannon commented 9 years ago

    Do realize that the strptime code is shared with time.strptime() and so this change would have to make sense in both contexts.

    This change can also only happen in Python 3.6 because it is backwards-incompatible due to people potentially already catching the previous exception.

    bitdancer commented 9 years ago

    And it does not make sense for time, since time.strptime does no additional validation, it just returns the result returned by _strptime.

    0ffa5ce0-07b2-47b7-90fe-79403f0626b8 commented 9 years ago

    I'm not sure what format I'm supposed to provide the test in. I attached a file that has the diff of the changes I made, and how the error message is changed (and improved!) in both datetime and time.

    bitdancer commented 9 years ago

    A unit test in test_strptime.

    pganssle commented 3 years ago

    I also commented on python/issues-test-cpython#26215 ( https://github.com/python/cpython/pull/26215 ), but for posterity, I'll note a few things:

    1. It seems that (and this may have changed since 2015), _strptime._strptime now has a stage that (unconditionally?) constructs a temporary datetime_date, which means it does do this particular validation in both time.strptime and datetime.strptime. That said, both flavors of strptime are way slower than I'd like them to be, and constructing an unnecessary date/datetime is a pretty good way to slow down your function, so if we ever go around optimizing this function, that may be one of the first bits on the chopping block.

    2. The logic for strptime is very complicated and it's very hard to test the full input space of the function (particularly since we're not using property tests (yet)...). This makes me somewhat uneasy about moving the validation stage from the beginning of the function (in parsing the regular expression) to the very end of the function (in the datetime constructor).

    It's probably safe to do so, but it may also be worth exploring the possibility of validating this directly in _strptime (possibly immediately after the string is parsed by the regex), and raising a cleaner error message on failure.

    Probably not worth spending a ton of time on that compared to improving the testing around this so that we can feel confident making changes under the hood. .strptime is really quite slow, and I wouldn't be surprised if we pulled out its guts and replaced most of the regex stuff with a fast C parser at some point in the future. Having good tests will both give us confidence to make this change (and that making this change won't lead to regressions in the future) and help with any future project to replace _strptime._strptime with a faster version, so I'd say that's the most important thing to do here.

    1fb19202-35e4-4e43-b523-9ebd42659bbd commented 3 years ago

    Can we close the ticket? It doesn't sound like we're going to go with anything like this approach.

    If there's a good place to record the general ambitions (better strptime tests, more clear strptime errors), that would be good, but I don't think we use tickets for that.

    1fb19202-35e4-4e43-b523-9ebd42659bbd commented 3 years ago

    Thinking about this a little more...

    I suggest that slowing down execution shouldn't be a worry in this case, because trying to parse a garbage string into a date shouldn't be something code is doing zillions of times, and if it is, being slow isn't a terrible thing. It should be the exception (so to speak) and not the rule, and thus not something we should worry about optimizing for.

    That said, I can see not wanting to rely on throwing low-level errors that are outside Python's direct control. (In fact, the PR I wrote didn't account for the OS-dependent nature of those messages, so if we decide to go ahead after all I'll need to make my PR account for that.) If that is enough reason to reject the change, I'm content with that.

    One way or the other, I would love to get this out of "Open" status - if we want to just close it as-is, I'll be satisfied.