Closed l0b0 closed 3 years ago
On the other hand, it looks like this happens when trying to run a simple loop over a query of a model with a UTCDateTimeAttribute
. So it looks like the issue is with pynamodb, not the application code. Sorry for the flip-flopping.
Thanks, it does look like a bug. t does make sense to either call strptime
when len(date_string) != 31
or perhaps pad with zeros?
length = len(date_string)
if 27 < length < 31:
date_string = '0' * (31 - length) + date_string
@l0b0 Can you provide a sample of the data you're deserializing and perhaps where it's coming from?
btw, your created_at
appears to be stuck at load time:
created_at = UTCDateTimeAttribute(null=False, default=datetime.now(timezone.utc))
you probably meant to do:
created_at = UTCDateTimeAttribute(null=False, default=lambda: datetime.now(timezone.utc))
@ikonst I don't have specific data, because I'm generating random datetimes for tests in the Dataset
class in datalake/backend/tests/utils.py. Basically the datetime values are any datetime, at second resolution, between approximately year 98 and 2000. So there's almost 50% chance of each datetime being before year 1000. Since there are two datetimes in each Dataset
the chance of a failure in any given Dataset
is ~75%.
Yeah. It's just that I tried it now myself and UTCDateTimeAttribute seems to encode pre-1000 years as YYYY padded with zeroes, so it should encode and decode well.
If you want to reproduce yourself, you could try with https://github.com/linz/geospatial-data-lake/pull/64, but it's not exactly simple:
Can you explain a little bit more what it does? Does it write to DynamoDB through some other means? Does it deal with existing data?
For padding, you could use value.zfill(31)
.
I got everything working by holding back pynamodb to version 4.3.3. I'll try upgrading pynamodb to 5.0.3 and report back.
Padding isn't the challenge. The problem is that I could not understand how the issue can be reproduced. I'd like to add a regression test.
From the above it looks like UTCDateTimeAttribute._fast_parse_utc_date_string("539-02-20T08:36:49.000000+0000")
would be a useful test - it currently throws an exception.
@l0b0 it's well understood why it fails to parse that string, but keep in mind it's just an internal method.
The rationale for that method not having to support 3-digit years is because we always serialize years as zero-padded 4 digits. We can "fix" that function to parse 3-digit years (it's not a difficult change to make), but it means we have a lurking bug somewhere else that causes 3-digit years to be written in the first place.
Do you have any idea what in our code or you code could be writing 3-digit years? Are you ever treating this same attribute as a UnicodeAttribute and persisting dates that you serialize "by hand"?
I'm going to have to see if I have time to look into this; it's a work project. I've worked around it by changing any_past_datetime
to return only datetimes back to year 1049. Unfortunately GitHub seems to throw away Actions output, so I can't even see the original test output as linked above.
The way UTCDateTimeAttribute serializes datetimes with < 4 digit years doesn't pad in zeros as far as I can tell.
Example copy/pasting the logic:
Out[78]: datetime.datetime(1, 1, 1, 0, 0, 0, 1, tzinfo=datetime.timezone.utc)
In [79]: test_date.astimezone(tzutc()).strftime('%Y-%m-%dT%H:%M:%S.%f%z')
Out[79]: '1-01-01T00:00:00.000001+0000'
If I set an item with a datetime attribute to a datetime with a < 4 digit year, it serializes (and saves) like this without error:
In [81]: obj.serialize()
Out[81]:
{'id': {'S': 'b1900ca1-896f-445a-98b6-f2eaf7be8e17'},
'timestamp': {'S': '1-01-01T00:00:00.000000+0000'},
'name': {'S': 'test'},
'version': {'N': '34'}}
In [82]: obj.save()
Out[82]: {'ConsumedCapacity': {'CapacityUnits': 1.0, 'TableName': 'test-table'}}
But then if I attempt to refresh the item:
In [83]: obj.refresh()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/<redacted>/lib/python3.6/site-packages/pynamodb/attributes.py in _fast_parse_utc_date_string(date_string)
701 or date_string[19] != '.' or date_string[26:31] != '+0000'):
--> 702 raise ValueError("Datetime string '{}' does not match format '{}'".format(date_string, DATETIME_FORMAT))
703 return datetime(
ValueError: Datetime string '1-01-01T00:00:00.000000+0000' does not match format '%Y-%m-%dT%H:%M:%S.%f%z'
Python strftime documentation indicates %Y
doesn't include padding for years prior to 1000.
It looks like UTCDateTimeAttribute.serialize
has been not padding zeros since April 2017 when Delorean was replaced.
In [8]: test_date
Out[8]: datetime.datetime(1, 1, 1, 0, 0, 0, 1, tzinfo=datetime.timezone.utc)
In [9]: Delorean(test_date, timezone="UTC").datetime.strftime("%Y-%m-%dT%H:%M:%S.%f%z")
Out[9]: '0001-01-01T00:00:00.000001+0000'
Yikes, you're right. This is apparently platform dependent too: https://bugs.python.org/issue13305
@l0b0 Apologize for my lack of faith. Took me getting a Docker image of a random Python version on a random Linux to witness this myself.
No worries @ikonst; I would be really sceptical too. It's kinda unbelievable that that bug has survived ten years already.
Python
datetime
supports dates back to year 0, but_fast_parse_utc_date_string
forces datetimes to have four digit years. This resulted in the following error:_fast_parse_utc_date_string
should fall back to a library call if it fails.Edit:
My bad, looks like
datetime.datetime.strptime('539-02-20T08:36:49.000000+0000', '%Y-%m-%dT%H:%M:%S.%f%z')
also fails, meaningisoformat()
doesn't produce parseable datetimes. :/