openva / crump

A parser for the Virginia State Corporation Commission's business registration records.
https://vabusinesses.org/
MIT License
20 stars 3 forks source link

Try to fix stupid dates #96

Closed waldoj closed 9 years ago

waldoj commented 9 years ago

There are a lot of expiry dates that are invalid dates, which causes Elasticsearch to reject the entire record. In the field expiration_date we're seeing dates like 2025-00-00, 9999-00-99, 2054-12-34, and 2057-09-31, which are all really stupid in their own ways.

At a minimum, we need to perform a sanity check on dates and, if they're invalid replace them with null values. Better, try to adjust them to something rational. For instance, if the year is more than X years in the future, replace the date with a null value (thus eliminate dates in the year 9999). Or if the year has a valid month but an invalid day, then replace the day with some default value (e.g., 01). Ditto for the month.

I hope there's some kind of a Python library that solves this.

softwaredoug commented 9 years ago

@waldoj just from playing around strptime throws exceptions when the date format doesn't match something sane

from datetime import datetime
weirdDate = "2010-13-13"
datetime.strptime(weirdDate, "%Y-%m-%d")

throws:

ValueError: time data '2010-13-13' does not match format '%Y-%m-%d'

So wrapping that in a function:

def isValid(dateStr):
    from datetime import datetime
    try:
        datetime.strptime(dateStr, "%Y-%m-%d")
    except ValueError as e:
        return False
    return True

I've tried it out with your dates, it seems to work well. Also works on valid dates. Don't know if that entirely solves your problem.

>>> isValid("2025-00-00")
False
>>> isValid("2013-2-2")
True
waldoj commented 9 years ago

Well, it's a start. :) Being able to detect invalid dates is a fine start. It's the correction of invalid dates that I'm particularly looking to accomplish here, but one step at a time! :) Thanks, @softwaredoug.

waldoj commented 9 years ago

Here's the distribution of stupid dates:

number date
1 2020-06-31
4 2025-00-00
1 2032-21-31
1 2034-06-31
1 2037-01-00
1 2037-16-54
1 2046-02-29
1 2047-04-31
1 2049-00-00
1 2050-02-29
1 2050-18-01
1 2053-00-00
1 2054-12-00
1 2054-12-34
1 2055-12-32
1 2057-09-31
1 2075-13-21
5354 9900-00-00
4 9999-00-99

So, that proved to be enlightening. There are basically 5,354 placeholders, and some noise that can be ignored.

slott56 commented 9 years ago

For conversion to "something rational", try this. It works for small values of rational.

import time
import datetime
import re

date_pat= re.compile(r"(?P<y>\d+)-(?P<m>\d+)-(?P<d>\d+)")
def convert_date(text):
    match= date_pat.match(text)
    y, m, d = map(int, (match.group('y'), match.group('m'), match.group('d')))
    try:
        dt= datetime.date(y, m, d)
    except ValueError as e:
        struct_t = (y, m, d, 0, 0, 0, 0, 0, 0)
        ts = time.mktime(struct_t)
        dt= datetime.datetime.fromtimestamp(ts).date()
    return dt

The time.mktime() fudges the invalid dates based on some kinds of possibly helpful assumptions.

You wind up with this kind of mapping from source to datetime.date object.

input output
2020-06-31 2020-07-01
2025-00-00 2024-11-30
2032-21-31 2033-10-01
2034-06-31 2034-07-01
2037-01-00 2036-12-31
2037-16-54 2038-05-24
2046-02-29 2046-03-01
2047-04-31 2047-05-01
2049-00-00 2048-11-30
2050-02-29 2050-03-01
2050-18-01 2051-06-01
2053-00-00 2052-11-30
2054-12-00 2054-11-30
2054-12-34 2055-01-03
2055-12-32 2056-01-01
2057-09-31 2057-10-01
2075-13-21 2076-01-21
9900-00-00 9899-11-30
9999-00-99 9999-03-09

The 9900-00-00 and 9999-00-99 dates should probably become some kind of N/A -- they look more like domain specific nulls than "errors".

The yyyy-00-dd should perhaps be special-cased, also, since the month of zero isn't really being handled well by mktime. Using m if m != 0 else 1 might be better. Hard to say without more context on what the dates are purported to mean.

Sidebar: I worked once with an app that produce 29 Feb, 30 Feb and 31 Feb frequently. Very frequently. It was a renewal date computed from your enrollment. If you enrolled on 31 Aug then your renewal was 31st of every subsequent month. The "date" wasn't really a date. It was a year, month, and day, but not a proper date and required some interpretation. And the software had IF statements all over the place to handle that nonsense.

slott56 commented 9 years ago

This is considerably more complex, but might produce "rational" dates for a much higher value of rational.

import warnings
import datetime
import re
import calendar
from functools import lru_cache

gregorian= calendar.Calendar()

@lru_cache(1200)
def last_day( year, month ):
    return max(gregorian.itermonthdays(year, month))

date_pat= re.compile(r"(?P<y>\d+)-(?P<m>\d+)-(?P<d>\d+)")
def convert_date(text):
    try:
        dt = datetime.datetime.strptime(text, "%Y-%m-%d").date()
        return dt
    except ValueError as e1:
        match= date_pat.match(text)
        if not match: 
            warnings.warn("Unparseable date: {0!r}".format(text))
            return None # Unparseable date case.
        y, m, d = map(int, (match.group('y'), match.group('m'), match.group('d')))
        if y >= 9900:
            return None # Domain null case.
        try:
            dt= datetime.date( y, m, d )
            return dt
        except ValueError as e2:
            pass
        # We know m and d don't make sense together.
        # We can coerce m into 1 .. 12 range. Which might be sufficient.
        m=  1 if m < 1 else m if m <= 12 else 12
        # m= max(1, min(m, 12))
        try:
            dt= datetime.date( y, m, d )
            return dt
        except ValueError as e3:
            pass
        # Sigh.
        # Coercing d is tougher because of the days per month issue.
        d_last= last_day(y, m)
        d= 1 if d < 1 else d if d <= d_last else d_last
        # d= max(1, min(d, d_last))
        try:
            dt= datetime.date( y, m, d )
            return dt
        except ValueError as e4:
            pass
        # Any other bright ideas?
        warnings.warn("Uncoerceable date: {0!r}".format(text))
        return None # Uncoerceable data case.

The comparison looks like this:

input output
2020-06-31 2020-06-30
2025-00-00 2025-01-01
2032-21-31 2032-12-31
2034-06-31 2034-06-30
2037-01-00 2037-01-01
2037-16-54 2037-12-31
2046-02-29 2046-02-28
2047-04-31 2047-04-30
2049-00-00 2049-01-01
2050-02-29 2050-02-28
2050-18-01 2050-12-01
2053-00-00 2053-01-01
2054-12-00 2054-12-01
2054-12-34 2054-12-31
2055-12-32 2055-12-31
2057-09-31 2057-09-30
2075-13-21 2075-12-21
9900-00-00 None
9999-00-99 None

These dates seem to be more usable. While this uses a lot of built-in Python machinery, it may not meet the threshold of being a "python library."

waldoj commented 9 years ago

@slott56, this is really clever. I'm so glad that you posted your solutions here, so that others can benefit from them, as I have no doubt that other folks will be seeking help for the same problem. I'm going to add this into Crump ASAP, and that's going to get ~5,400 records included that are currently being rejected (rightly) by Elasticsearch. Thanks so very much!

waldoj commented 9 years ago

@slott56, I just want to check: will you release your above code under the MIT License (or a compatible license) for inclusion in this software or elsewhere?

brianwc commented 9 years ago

I also need something like this, but don't like changing 2050-18-01 to 2050-12-01. I think another epicycle should be added where if m > 12 then we check to see if d <= 12, and if so, we assume m and d got mixed up and swap em.

slott56 commented 9 years ago

Permission given to release under any license that seems to make sense. Release away.

slott56 commented 9 years ago

Not sure I agree with swapping month and day. There's only seems to be one example of that in this dataset. It seems unlikely for Americans to make this mistake unless some international dealings are going on.

The randomness of the errors is hard to fathom. That one can't be a nearby character swap. It's right awkward to rationalize. The 00 entries seem to be a kind of "don't care" for which substituting 01 is sensible.

I'd be more willing to guess than 18 is really 08 entered on the keypad instead of the number row.

waldoj commented 9 years ago

Permission given to release under any license that seems to make sense.

Thank you!

The 00 entries seem to be a kind of "don't care" for which substituting 01 is sensible.

That's my guess, too—that 00 means either "don't know" or "don't care," perhaps an effort to say "I only know the month for this."

Note that there are also thousands of 9999-99-99 of dates not included in the above tally. I've long set those to None, since that's surely what it's supposed to mean, so I didn't have these figures in the data that was output. The distinction between 9900-00-00 and 9999-99-99 is lost on me. My best guess is that, at some point, some sysadmin noticed the 9999-99-99 dates and prohibited it, leading to a new practice among the data entry folks of typing in 9900-00-99.

slott56 commented 9 years ago

The other possibility is that these are domain-specific nulls with different semantics. One is a "not applicable". The other could be "exempt according some some legislation". Or maybe "not applicable due to some judicial decision". Or "no longer relevant due to some administrative policy change." The core problem with "NULL" is that the semantics are lost.