pynamodb / PynamoDB

A pythonic interface to Amazon's DynamoDB
http://pynamodb.readthedocs.io
MIT License
2.44k stars 427 forks source link

_fast_parse_utc_date_string fails on pre year 1000 #899

Closed l0b0 closed 3 years ago

l0b0 commented 3 years ago

Python datetime supports dates back to year 0, but _fast_parse_utc_date_string forces datetimes to have four digit years. This resulted in the following error:

ValueError: Datetime string '539-02-20T08:36:49.000000+0000' does not match format '%Y-%m-%dT%H:%M:%S.%f%z'

_fast_parse_utc_date_string should fall back to a library call if it fails.

Edit:

My bad, looks like datetime.datetime.strptime('539-02-20T08:36:49.000000+0000', '%Y-%m-%dT%H:%M:%S.%f%z') also fails, meaning isoformat() doesn't produce parseable datetimes. :/

l0b0 commented 3 years ago

On the other hand, it looks like this happens when trying to run a simple loop over a query of a model with a UTCDateTimeAttribute. So it looks like the issue is with pynamodb, not the application code. Sorry for the flip-flopping.

ikonst commented 3 years ago

Thanks, it does look like a bug. t does make sense to either call strptime when len(date_string) != 31 or perhaps pad with zeros?

length = len(date_string)
if 27 < length < 31:
  date_string = '0' * (31 - length) + date_string
ikonst commented 3 years ago

@l0b0 Can you provide a sample of the data you're deserializing and perhaps where it's coming from?

btw, your created_at appears to be stuck at load time:

created_at = UTCDateTimeAttribute(null=False, default=datetime.now(timezone.utc))

you probably meant to do:

created_at = UTCDateTimeAttribute(null=False, default=lambda: datetime.now(timezone.utc))
l0b0 commented 3 years ago

@ikonst I don't have specific data, because I'm generating random datetimes for tests in the Dataset class in datalake/backend/tests/utils.py. Basically the datetime values are any datetime, at second resolution, between approximately year 98 and 2000. So there's almost 50% chance of each datetime being before year 1000. Since there are two datetimes in each Dataset the chance of a failure in any given Dataset is ~75%.

ikonst commented 3 years ago

Yeah. It's just that I tried it now myself and UTCDateTimeAttribute seems to encode pre-1000 years as YYYY padded with zeroes, so it should encode and decode well.

l0b0 commented 3 years ago

If you want to reproduce yourself, you could try with https://github.com/linz/geospatial-data-lake/pull/64, but it's not exactly simple:

  1. Check out that branch
  2. Build and deploy the AWS resources (see README.md)
  3. Run the tests (see .github/workflows/test.yml). Each run has at least a 75% chance of failing, but you might have to try with more than one random seed.
ikonst commented 3 years ago

Can you explain a little bit more what it does? Does it write to DynamoDB through some other means? Does it deal with existing data?

l0b0 commented 3 years ago

For padding, you could use value.zfill(31).

l0b0 commented 3 years ago

I got everything working by holding back pynamodb to version 4.3.3. I'll try upgrading pynamodb to 5.0.3 and report back.

ikonst commented 3 years ago

Padding isn't the challenge. The problem is that I could not understand how the issue can be reproduced. I'd like to add a regression test.

l0b0 commented 3 years ago

From the above it looks like UTCDateTimeAttribute._fast_parse_utc_date_string("539-02-20T08:36:49.000000+0000") would be a useful test - it currently throws an exception.

ikonst commented 3 years ago

@l0b0 it's well understood why it fails to parse that string, but keep in mind it's just an internal method.

The rationale for that method not having to support 3-digit years is because we always serialize years as zero-padded 4 digits. We can "fix" that function to parse 3-digit years (it's not a difficult change to make), but it means we have a lurking bug somewhere else that causes 3-digit years to be written in the first place.

Do you have any idea what in our code or you code could be writing 3-digit years? Are you ever treating this same attribute as a UnicodeAttribute and persisting dates that you serialize "by hand"?

l0b0 commented 3 years ago

I'm going to have to see if I have time to look into this; it's a work project. I've worked around it by changing any_past_datetime to return only datetimes back to year 1049. Unfortunately GitHub seems to throw away Actions output, so I can't even see the original test output as linked above.

stoppablemurph commented 3 years ago

The way UTCDateTimeAttribute serializes datetimes with < 4 digit years doesn't pad in zeros as far as I can tell.

Example copy/pasting the logic:

Out[78]: datetime.datetime(1, 1, 1, 0, 0, 0, 1, tzinfo=datetime.timezone.utc)

In [79]: test_date.astimezone(tzutc()).strftime('%Y-%m-%dT%H:%M:%S.%f%z')                                                                                                                   
Out[79]: '1-01-01T00:00:00.000001+0000'

If I set an item with a datetime attribute to a datetime with a < 4 digit year, it serializes (and saves) like this without error:

In [81]: obj.serialize()                                                                                                                                                                    
Out[81]: 
{'id': {'S': 'b1900ca1-896f-445a-98b6-f2eaf7be8e17'},
 'timestamp': {'S': '1-01-01T00:00:00.000000+0000'},
 'name': {'S': 'test'},
 'version': {'N': '34'}}

In [82]: obj.save()                                                                                                                                                                         
Out[82]: {'ConsumedCapacity': {'CapacityUnits': 1.0, 'TableName': 'test-table'}}

But then if I attempt to refresh the item:

In [83]: obj.refresh()                                                                                                                                                                      
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/<redacted>/lib/python3.6/site-packages/pynamodb/attributes.py in _fast_parse_utc_date_string(date_string)
    701                     or date_string[19] != '.' or date_string[26:31] != '+0000'):
--> 702                 raise ValueError("Datetime string '{}' does not match format '{}'".format(date_string, DATETIME_FORMAT))
    703             return datetime(

ValueError: Datetime string '1-01-01T00:00:00.000000+0000' does not match format '%Y-%m-%dT%H:%M:%S.%f%z'

Python strftime documentation indicates %Y doesn't include padding for years prior to 1000.

It looks like UTCDateTimeAttribute.serialize has been not padding zeros since April 2017 when Delorean was replaced.

In [8]: test_date
Out[8]: datetime.datetime(1, 1, 1, 0, 0, 0, 1, tzinfo=datetime.timezone.utc)

In [9]: Delorean(test_date, timezone="UTC").datetime.strftime("%Y-%m-%dT%H:%M:%S.%f%z")
Out[9]: '0001-01-01T00:00:00.000001+0000'
ikonst commented 3 years ago

Yikes, you're right. This is apparently platform dependent too: https://bugs.python.org/issue13305

ikonst commented 3 years ago

@l0b0 Apologize for my lack of faith. Took me getting a Docker image of a random Python version on a random Linux to witness this myself.

l0b0 commented 3 years ago

No worries @ikonst; I would be really sceptical too. It's kinda unbelievable that that bug has survived ten years already.