Closed antalgu closed 1 week ago
Hi,
Thanks for clearly outlining and describing the issue! I agree with your assessment, dateutil.parse
seems like a much better option for handling date parsing. I attempted to add your suggestions in #94 . Specifically, the BaseParser and parse_date function now look like this:
class BaseParser:
reg_expressions = {}
date_keys = ()
multiple_match_keys = ()
# For handling special cases in TLD parser classes
known_date_formats = []
# Extra formats that dateutil might not figure out
extra_date_formats = [
"%Y-%m-%dT%H:%M:%SZ[%Z]", # 2007-01-26T19:10:31Z[UTC]
"%Y-%m-%dT%H:%M:%S.%fZ", # 2018-12-01T16:17:30.568Z
"%Y-%m-%dT%H:%M:%S%zZ", # 1970-01-01T02:00:00+02:00Z
"%Y-%m-%dt%H:%M:%S.%fz", # 2007-01-26t19:10:31.00z
"%Y-%m-%d %H:%M:%SZ", # 2000-08-22 18:55:20Z
"before %b-%Y", # before aug-1996
]
# Additional timezone info for dateutil
timezone_info = {
"KST": tz.gettz("Asia/Seoul"), # Korea Standard Time UTC+9
"JST": tz.gettz("Asia/Tokyo"), # Japan Standard Time UTC+9
"EEST": tz.gettz("Europe/Athens"), # Eastern European Summertime UTC+3
}
...
def _parse_date(self, date_string: str) -> Union[datetime, str]:
"""
Attempts to convert the given date string to a datetime.datetime object
otherwise returns the input `date_string`
:param date_string: a date string
:return: a datetime.datetime object
"""
def _datetime_or_none(dt_string: str, dt_format: str) -> Union[datetime, None]:
try:
return datetime.strptime(dt_string, dt_format)
except ValueError:
return None
# first, try the known formats
for date_format in self.known_date_formats:
if date := _datetime_or_none(date_string, date_format):
return date
# next, try dateutil.parse
try:
clean_date_string = re.sub(r"\(([^)]+)\)", r"\1", date_string).strip()
return parse(clean_date_string, tzinfos=self.timezone_info)
except ParserError:
pass
# finally, try extra formats
for date_format in self.extra_date_formats:
if date := _datetime_or_none(date_string, date_format):
return date
# no luck parsing
return date_string
I also added and modified your example script under tests/test_dateparsers.py
, which looks like:
from datetime import datetime
from asyncwhois.parse import BaseParser
def test_dateparsers(): # noqa
date_and_time_examples = [
"2010-07-04 04:18:23 +03:00",
"2024-02-19 01:30:15.927683+11",
"2008-08-31 04:14:06 KST",
"2024/01/01 01:05:04 (JST)",
"07 Aug 2024",
]
date_and_time_examples += [
"02-jan-2000",
"11-February-2000",
"20-10-2000",
"2000-01-02",
"2.1.2000",
"2000.01.02",
"2000/01/02",
"2011/06/01 01:05:01",
"2011/06/01 01:05:01 (+0900)",
"20170209",
"20110908 14:44:51",
"02/01/2013",
"2000. 01. 02.",
"2014.03.08 10:28:24",
"24-Jul-2009 13:20:03 UTC",
"Tue Jun 21 23:59:59 GMT 2011",
"2007-01-26T19:10:31",
"2007-01-26T19:10:31Z",
"2007-01-26T19:10:31Z[UTC]", # extra
"2018-12-01T16:17:30.568Z", # extra
"2011-09-08T14:44:51.622265+03:00",
"2013-12-06T08:17:22-0800",
"1970-01-01T02:00:00+02:00Z", # extra
"2011-09-08t14:44:51.622265",
"2007-01-26T19:10:31",
"2007-01-26T19:10:31Z",
"2007-01-26t19:10:31.00z", # extra
"2011-03-30T19:36:27+0200",
"2011-09-08T14:44:51.622265+03:00",
"2000-08-22 18:55:20Z", # extra
"2000-08-22 18:55:20",
"08 Apr 2013 05:44:00",
"23/04/2015 12:00:07",
"23/04/2015 12:00:07 EEST",
"23/04/2015 12:00:07.619546 EEST",
"2015-04-23 12:00:07.619546",
"August 14 2017",
"08.03.2014 10:28:24",
"Tue Dec 12 2000",
"before aug-1996", # extra
"2017-09-26 11:38:29 (GMT+00:00)",
]
bp = BaseParser()
for dt in date_and_time_examples:
result = bp._parse_date(dt)
assert isinstance(result, datetime), f"Failed to parse date string: {dt}"
Let me know if this looks OK or if there is anything I may have overlooked in your suggestion.
We've been using your library for a while and there have been quite a few validation errors from uncontrolled datetime formats. About two hundred date-time formats come from these formats:
2010-07-04 04:18:23 +03:00 2024-02-19 01:30:15.927683+11 07 Aug 2024 2008-08-31 04:14:06 KST 2024/01/01 01:05:04 (JST)
The parser from dateutils.parse was able to control the first three formats, and by adding a method to remove the "(" and ")" and using tzinfos i was able to convert all of them properly:
def clean_timezone(date_str) return re.sub(r'(([^)]+))', r'\1', date_str).strip()
... Existing code ... tzinfos = { 'KST': tz.gettz('Asia/Seoul'), # Korea Standard Time UTC+9 'JST': tz.gettz('Asia/Tokyo'), # Japan Standard Time UTC+9 } .... if type(whois_info['created']) == str: created = parse(clean_timezone(whois_info['created']), tzinfos=tzinfos) else:
created = whois_info['created'] ...
I noticed you're using a list of date_time formats and trying to parse the string with these formats: https://github.com/pogzyb/asyncwhois/blob/b9cdc92a57b9a56e64fbb23947e9223c77ab47c6/asyncwhois/parse.py#L337-L350
I think you would be better off by using dateutil.parser with a few tzinfos and then having a reduced list that is used when that can't cut it.
Here I put a python script (in .txt) to test all your formats + these 5, and their timings with dateutils.parser and with your function. There are 6 commented formats which would be the only ones you need to control with an except and your method instead of with dateutil.parser. The tests might not be 100% accurate because your list could be ordered by popularity and so doing the same amount of tests with each format would not be the best method, however i think the difference in time is enough to indicate that the new method might be better.
asynwhois_vs_custom_tests.txt