python / cpython

The Python programming language
https://www.python.org
Other
63.13k stars 30.22k forks source link

Support parsing invalid date strings #110648

Closed yfa-vagelis closed 1 year ago

yfa-vagelis commented 1 year ago

I don't know if this is already discussed, but I want to parse some date strings, but they are not always valid (e.g. 30 Feb 2023).

When I try to use the datetime.strptime I get an error, which is raised not because there is a problem with parsing the string, but because the parsed fields (i.e day, month, year) cannot create a valid datetime object.

Is it possible to get the parsed fields anyway?

majid-vaghari commented 1 year ago

Python's datetime library is designed to work with valid dates in the Gregorian calendar system. Using it to parse dates from other calendar systems or invalid dates can lead to unexpected errors and it is not recommended.

If you're dealing with dates in a different calendar system, it's better to use a dedicated library for that system. There are many third-party libraries available for different calendar systems, such as Persian or Chinese calendars. Implementing a custom calendar system is usually not a good idea, as it can be complex and error-prone.

If you're receiving data from a source that doesn't validate dates, it's important to handle this appropriately. Converting invalid dates (like '30 Feb') to valid ones (like '2 Mar') might seem like a solution, but it can lead to more problems. For example, it can cause issues in leap years.

If you need to parse strings that might contain invalid dates, regardless of the reason, a good approach would be to use regular expressions. This allows you to extract specific patterns from strings without considering whether they form a valid date or not.

Here's an example of how you can do this:

import re

date = '30 Feb 2023'
# This pattern matches two digits (day),
# any number of white space characters, three letters (month),
# any number of white space characters, and four digits (year)
pattern = r'(?P<d>\d{2})\s+(?P<b>\w{3})\s+(?P<Y>\d{4})' 
match = re.match(pattern, date.strip())

if match:
  day = match.group('d')
  month = match.group('b')
  year = match.group('Y')

  print(f'day: {day} month: {month} year: {year}')
else:
  print('no match found')

Note that this code assumes that the date is always in the format 'DD MMM YYYY'. If the format can vary, you'll need to adjust the regular expression accordingly.

For more information on regular expressions, refer to the Python re module documentation.

yfa-vagelis commented 1 year ago

@majid-vaghari Hello and thanks for your reply!

Upon further investigation, I discovered that the parsing is carried out within the _strptime._strptime function, which generates a datetime object to compute weekday and julian if needed. Consequently, I'm still encountering the ValueError: day is out of range for month.

So, my question is, why isn't there a function available that exclusively performs parsing and provides the parsed fields without performing validation? Such a function would enable users to utilize the results as needed.

Edit: I was thinking about making a custom date parser using regex, as you suggested, but I wanted to avoid it since there is already a solution out there.

majid-vaghari commented 1 year ago

I understand your perspective and concerns. Python's datetime module is implemented in C and is designed to work with valid dates in the Gregorian calendar system, as seen in the source code. As such, it validates the parsed fields to ensure they form a valid date.

There are several reasons why the functionality you're suggesting is not part of the Python's datetime module:

  1. Data Integrity: By enforcing validation, Python's datetime module prevents data corruption and inconsistencies that could result in serious issues in your application.
  2. Predictability: The validation ensures that if you have a date object, it represents a valid date.
  3. Scope of Responsibility: The datetime module is designed to handle valid dates and time-related functions. Handling invalid dates would expand the scope of this module, potentially making it less efficient and harder to maintain.
  4. Performance: The addition of parsing invalid dates would increase complexity and could potentially impact the performance of the module.

Should the datetime module not validate dates, users would have to handle validation every time they used it, leading to potential inconsistencies and errors.

While it's technically possible to read the source code and use undocumented functions in Python and C to bypass the validation, doing so is not recommended. This approach is messy, prone to errors, and can lead to maintenance issues as these functions are not officially supported and could change in future versions of Python. Plus, it's not very practical and there's really no need to make such functionality available in Python's datetime.

While I understand that building a custom parser using regular expressions may not be the ideal solution, it may be the most viable one given your unique requirements. It offers the flexibility to parse the date fields without validation and handle invalid dates according to your specific needs.

Alternatively, you could explore third-party date parsing libraries that might better cater to your needs. Please bear in mind that dealing with invalid dates always requires careful consideration to avoid data inconsistencies and errors.

sunmy2019 commented 1 year ago

So, my question is, why isn't there a function available that exclusively performs parsing and provides the parsed fields without performing validation? Such a function would enable users to utilize the results as needed.

There must be one out there, somewhere. Customized needs can be achieved through 3-rd party packages.

yfa-vagelis commented 1 year ago

@sunmy2019 Indeed, I'll have to choose a custom solution after all.

Thank you both anyway!