pogzyb / asyncwhois

Python WHOIS and RDAP utility for querying and parsing information about Domains, IPv4s, IPv6s, and AS numbers
MIT License
63 stars 18 forks source link

Uncontrolled datetime formats #93

Closed antalgu closed 1 week ago

antalgu commented 2 weeks ago

We've been using your library for a while and there have been quite a few validation errors from uncontrolled datetime formats. About two hundred date-time formats come from these formats:

2010-07-04 04:18:23 +03:00 2024-02-19 01:30:15.927683+11 07 Aug 2024 2008-08-31 04:14:06 KST 2024/01/01 01:05:04 (JST)

The parser from dateutils.parse was able to control the first three formats, and by adding a method to remove the "(" and ")" and using tzinfos i was able to convert all of them properly:

def clean_timezone(date_str) return re.sub(r'(([^)]+))', r'\1', date_str).strip()

... Existing code ... tzinfos = { 'KST': tz.gettz('Asia/Seoul'), # Korea Standard Time UTC+9 'JST': tz.gettz('Asia/Tokyo'), # Japan Standard Time UTC+9 } .... if type(whois_info['created']) == str: created = parse(clean_timezone(whois_info['created']), tzinfos=tzinfos) else:
created = whois_info['created'] ...

I noticed you're using a list of date_time formats and trying to parse the string with these formats: https://github.com/pogzyb/asyncwhois/blob/b9cdc92a57b9a56e64fbb23947e9223c77ab47c6/asyncwhois/parse.py#L337-L350

I think you would be better off by using dateutil.parser with a few tzinfos and then having a reduced list that is used when that can't cut it.

Here I put a python script (in .txt) to test all your formats + these 5, and their timings with dateutils.parser and with your function. There are 6 commented formats which would be the only ones you need to control with an except and your method instead of with dateutil.parser. The tests might not be 100% accurate because your list could be ordered by popularity and so doing the same amount of tests with each format would not be the best method, however i think the difference in time is enough to indicate that the new method might be better.

tests asynwhois_vs_custom_tests.txt

pogzyb commented 1 week ago

Hi,

Thanks for clearly outlining and describing the issue! I agree with your assessment, dateutil.parse seems like a much better option for handling date parsing. I attempted to add your suggestions in #94 . Specifically, the BaseParser and parse_date function now look like this:

class BaseParser:
    reg_expressions = {}

    date_keys = ()
    multiple_match_keys = ()

    # For handling special cases in TLD parser classes
    known_date_formats = []
    # Extra formats that dateutil might not figure out
    extra_date_formats = [
        "%Y-%m-%dT%H:%M:%SZ[%Z]",  # 2007-01-26T19:10:31Z[UTC]
        "%Y-%m-%dT%H:%M:%S.%fZ",  # 2018-12-01T16:17:30.568Z
        "%Y-%m-%dT%H:%M:%S%zZ",  # 1970-01-01T02:00:00+02:00Z
        "%Y-%m-%dt%H:%M:%S.%fz",  # 2007-01-26t19:10:31.00z
        "%Y-%m-%d %H:%M:%SZ",  # 2000-08-22 18:55:20Z
        "before %b-%Y",  # before aug-1996
    ]
    # Additional timezone info for dateutil
    timezone_info = {
        "KST": tz.gettz("Asia/Seoul"),  # Korea Standard Time UTC+9
        "JST": tz.gettz("Asia/Tokyo"),  # Japan Standard Time UTC+9
        "EEST": tz.gettz("Europe/Athens"),  # Eastern European Summertime UTC+3
    }

    ...

    def _parse_date(self, date_string: str) -> Union[datetime, str]:
        """
        Attempts to convert the given date string to a datetime.datetime object
        otherwise returns the input `date_string`
        :param date_string: a date string
        :return: a datetime.datetime object
        """

        def _datetime_or_none(dt_string: str, dt_format: str) -> Union[datetime, None]:
            try:
                return datetime.strptime(dt_string, dt_format)
            except ValueError:
                return None

        # first, try the known formats
        for date_format in self.known_date_formats:
            if date := _datetime_or_none(date_string, date_format):
                return date
        # next, try dateutil.parse
        try:
            clean_date_string = re.sub(r"\(([^)]+)\)", r"\1", date_string).strip()
            return parse(clean_date_string, tzinfos=self.timezone_info)
        except ParserError:
            pass
        # finally, try extra formats
        for date_format in self.extra_date_formats:
            if date := _datetime_or_none(date_string, date_format):
                return date
        # no luck parsing
        return date_string

I also added and modified your example script under tests/test_dateparsers.py, which looks like:

from datetime import datetime

from asyncwhois.parse import BaseParser

def test_dateparsers():  # noqa
    date_and_time_examples = [
        "2010-07-04 04:18:23 +03:00",
        "2024-02-19 01:30:15.927683+11",
        "2008-08-31 04:14:06 KST",
        "2024/01/01 01:05:04 (JST)",
        "07 Aug 2024",
    ]
    date_and_time_examples += [
        "02-jan-2000",
        "11-February-2000",
        "20-10-2000",
        "2000-01-02",
        "2.1.2000",
        "2000.01.02",
        "2000/01/02",
        "2011/06/01 01:05:01",
        "2011/06/01 01:05:01 (+0900)",
        "20170209",
        "20110908 14:44:51",
        "02/01/2013",
        "2000. 01. 02.",
        "2014.03.08 10:28:24",
        "24-Jul-2009 13:20:03 UTC",
        "Tue Jun 21 23:59:59 GMT 2011",
        "2007-01-26T19:10:31",
        "2007-01-26T19:10:31Z",
        "2007-01-26T19:10:31Z[UTC]",  # extra
        "2018-12-01T16:17:30.568Z",  # extra
        "2011-09-08T14:44:51.622265+03:00",
        "2013-12-06T08:17:22-0800",
        "1970-01-01T02:00:00+02:00Z",  # extra
        "2011-09-08t14:44:51.622265",
        "2007-01-26T19:10:31",
        "2007-01-26T19:10:31Z",
        "2007-01-26t19:10:31.00z",  # extra
        "2011-03-30T19:36:27+0200",
        "2011-09-08T14:44:51.622265+03:00",
        "2000-08-22 18:55:20Z",  # extra
        "2000-08-22 18:55:20",
        "08 Apr 2013 05:44:00",
        "23/04/2015 12:00:07",
        "23/04/2015 12:00:07 EEST",
        "23/04/2015 12:00:07.619546 EEST",
        "2015-04-23 12:00:07.619546",
        "August 14 2017",
        "08.03.2014 10:28:24",
        "Tue Dec 12 2000",
        "before aug-1996",  # extra
        "2017-09-26 11:38:29 (GMT+00:00)",
    ]

    bp = BaseParser()

    for dt in date_and_time_examples:
        result = bp._parse_date(dt)
        assert isinstance(result, datetime), f"Failed to parse date string: {dt}"

Let me know if this looks OK or if there is anything I may have overlooked in your suggestion.