Closed waldoj closed 9 years ago
@waldoj just from playing around strptime
throws exceptions when the date format doesn't match something sane
from datetime import datetime
weirdDate = "2010-13-13"
datetime.strptime(weirdDate, "%Y-%m-%d")
throws:
ValueError: time data '2010-13-13' does not match format '%Y-%m-%d'
So wrapping that in a function:
def isValid(dateStr):
from datetime import datetime
try:
datetime.strptime(dateStr, "%Y-%m-%d")
except ValueError as e:
return False
return True
I've tried it out with your dates, it seems to work well. Also works on valid dates. Don't know if that entirely solves your problem.
>>> isValid("2025-00-00") False >>> isValid("2013-2-2") True
Well, it's a start. :) Being able to detect invalid dates is a fine start. It's the correction of invalid dates that I'm particularly looking to accomplish here, but one step at a time! :) Thanks, @softwaredoug.
Here's the distribution of stupid dates:
number | date |
---|---|
1 | 2020-06-31 |
4 | 2025-00-00 |
1 | 2032-21-31 |
1 | 2034-06-31 |
1 | 2037-01-00 |
1 | 2037-16-54 |
1 | 2046-02-29 |
1 | 2047-04-31 |
1 | 2049-00-00 |
1 | 2050-02-29 |
1 | 2050-18-01 |
1 | 2053-00-00 |
1 | 2054-12-00 |
1 | 2054-12-34 |
1 | 2055-12-32 |
1 | 2057-09-31 |
1 | 2075-13-21 |
5354 | 9900-00-00 |
4 | 9999-00-99 |
So, that proved to be enlightening. There are basically 5,354 placeholders, and some noise that can be ignored.
For conversion to "something rational", try this. It works for small values of rational.
import time
import datetime
import re
date_pat= re.compile(r"(?P<y>\d+)-(?P<m>\d+)-(?P<d>\d+)")
def convert_date(text):
match= date_pat.match(text)
y, m, d = map(int, (match.group('y'), match.group('m'), match.group('d')))
try:
dt= datetime.date(y, m, d)
except ValueError as e:
struct_t = (y, m, d, 0, 0, 0, 0, 0, 0)
ts = time.mktime(struct_t)
dt= datetime.datetime.fromtimestamp(ts).date()
return dt
The time.mktime()
fudges the invalid dates based on some kinds of possibly helpful assumptions.
You wind up with this kind of mapping from source to datetime.date
object.
input | output |
---|---|
2020-06-31 | 2020-07-01 |
2025-00-00 | 2024-11-30 |
2032-21-31 | 2033-10-01 |
2034-06-31 | 2034-07-01 |
2037-01-00 | 2036-12-31 |
2037-16-54 | 2038-05-24 |
2046-02-29 | 2046-03-01 |
2047-04-31 | 2047-05-01 |
2049-00-00 | 2048-11-30 |
2050-02-29 | 2050-03-01 |
2050-18-01 | 2051-06-01 |
2053-00-00 | 2052-11-30 |
2054-12-00 | 2054-11-30 |
2054-12-34 | 2055-01-03 |
2055-12-32 | 2056-01-01 |
2057-09-31 | 2057-10-01 |
2075-13-21 | 2076-01-21 |
9900-00-00 | 9899-11-30 |
9999-00-99 | 9999-03-09 |
The 9900-00-00
and 9999-00-99
dates should probably become some kind of N/A -- they look more like domain specific nulls than "errors".
The yyyy-00-dd
should perhaps be special-cased, also, since the month of zero isn't really being handled well by mktime
. Using m if m != 0 else 1
might be better. Hard to say without more context on what the dates are purported to mean.
Sidebar: I worked once with an app that produce 29 Feb, 30 Feb and 31 Feb frequently. Very frequently. It was a renewal date computed from your enrollment. If you enrolled on 31 Aug then your renewal was 31st of every subsequent month. The "date" wasn't really a date. It was a year, month, and day, but not a proper date and required some interpretation. And the software had IF statements all over the place to handle that nonsense.
This is considerably more complex, but might produce "rational" dates for a much higher value of rational.
import warnings
import datetime
import re
import calendar
from functools import lru_cache
gregorian= calendar.Calendar()
@lru_cache(1200)
def last_day( year, month ):
return max(gregorian.itermonthdays(year, month))
date_pat= re.compile(r"(?P<y>\d+)-(?P<m>\d+)-(?P<d>\d+)")
def convert_date(text):
try:
dt = datetime.datetime.strptime(text, "%Y-%m-%d").date()
return dt
except ValueError as e1:
match= date_pat.match(text)
if not match:
warnings.warn("Unparseable date: {0!r}".format(text))
return None # Unparseable date case.
y, m, d = map(int, (match.group('y'), match.group('m'), match.group('d')))
if y >= 9900:
return None # Domain null case.
try:
dt= datetime.date( y, m, d )
return dt
except ValueError as e2:
pass
# We know m and d don't make sense together.
# We can coerce m into 1 .. 12 range. Which might be sufficient.
m= 1 if m < 1 else m if m <= 12 else 12
# m= max(1, min(m, 12))
try:
dt= datetime.date( y, m, d )
return dt
except ValueError as e3:
pass
# Sigh.
# Coercing d is tougher because of the days per month issue.
d_last= last_day(y, m)
d= 1 if d < 1 else d if d <= d_last else d_last
# d= max(1, min(d, d_last))
try:
dt= datetime.date( y, m, d )
return dt
except ValueError as e4:
pass
# Any other bright ideas?
warnings.warn("Uncoerceable date: {0!r}".format(text))
return None # Uncoerceable data case.
The comparison looks like this:
input | output |
---|---|
2020-06-31 | 2020-06-30 |
2025-00-00 | 2025-01-01 |
2032-21-31 | 2032-12-31 |
2034-06-31 | 2034-06-30 |
2037-01-00 | 2037-01-01 |
2037-16-54 | 2037-12-31 |
2046-02-29 | 2046-02-28 |
2047-04-31 | 2047-04-30 |
2049-00-00 | 2049-01-01 |
2050-02-29 | 2050-02-28 |
2050-18-01 | 2050-12-01 |
2053-00-00 | 2053-01-01 |
2054-12-00 | 2054-12-01 |
2054-12-34 | 2054-12-31 |
2055-12-32 | 2055-12-31 |
2057-09-31 | 2057-09-30 |
2075-13-21 | 2075-12-21 |
9900-00-00 | None |
9999-00-99 | None |
These dates seem to be more usable. While this uses a lot of built-in Python machinery, it may not meet the threshold of being a "python library."
@slott56, this is really clever. I'm so glad that you posted your solutions here, so that others can benefit from them, as I have no doubt that other folks will be seeking help for the same problem. I'm going to add this into Crump ASAP, and that's going to get ~5,400 records included that are currently being rejected (rightly) by Elasticsearch. Thanks so very much!
@slott56, I just want to check: will you release your above code under the MIT License (or a compatible license) for inclusion in this software or elsewhere?
I also need something like this, but don't like changing 2050-18-01 to 2050-12-01. I think another epicycle should be added where if m > 12 then we check to see if d <= 12, and if so, we assume m and d got mixed up and swap em.
Permission given to release under any license that seems to make sense. Release away.
Not sure I agree with swapping month and day. There's only seems to be one example of that in this dataset. It seems unlikely for Americans to make this mistake unless some international dealings are going on.
The randomness of the errors is hard to fathom. That one can't be a nearby character swap. It's right awkward to rationalize. The 00 entries seem to be a kind of "don't care" for which substituting 01 is sensible.
I'd be more willing to guess than 18 is really 08 entered on the keypad instead of the number row.
Permission given to release under any license that seems to make sense.
Thank you!
The 00 entries seem to be a kind of "don't care" for which substituting 01 is sensible.
That's my guess, too—that 00
means either "don't know" or "don't care," perhaps an effort to say "I only know the month for this."
Note that there are also thousands of 9999-99-99
of dates not included in the above tally. I've long set those to None
, since that's surely what it's supposed to mean, so I didn't have these figures in the data that was output. The distinction between 9900-00-00
and 9999-99-99
is lost on me. My best guess is that, at some point, some sysadmin noticed the 9999-99-99
dates and prohibited it, leading to a new practice among the data entry folks of typing in 9900-00-99
.
The other possibility is that these are domain-specific nulls with different semantics. One is a "not applicable". The other could be "exempt according some some legislation". Or maybe "not applicable due to some judicial decision". Or "no longer relevant due to some administrative policy change." The core problem with "NULL" is that the semantics are lost.
There are a lot of expiry dates that are invalid dates, which causes Elasticsearch to reject the entire record. In the field
expiration_date
we're seeing dates like2025-00-00
,9999-00-99
,2054-12-34
, and2057-09-31
, which are all really stupid in their own ways.At a minimum, we need to perform a sanity check on dates and, if they're invalid replace them with null values. Better, try to adjust them to something rational. For instance, if the year is more than X years in the future, replace the date with a null value (thus eliminate dates in the year 9999). Or if the year has a valid month but an invalid day, then replace the day with some default value (e.g.,
01
). Ditto for the month.I hope there's some kind of a Python library that solves this.