man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.52k stars 93 forks source link

Document that as_of interprets argument as UTC and document how point-in-time works #483

Open jamesmunro opened 1 year ago

jamesmunro commented 1 year ago

Describe the bug

Everywhere as_of can be used we should document what format of timestamp it supports and that it assumes it to be UTC (rather than the local time). Also, what happens for timestamps before the first write and between first and second writes in example below?

Related documentation: https://docs.arcticdb.io/api/library#arcticdb.version_store.library.Library.read

Steps/Code to Reproduce

from time import sleep
from datetime import datetime
import pandas as pd
from arcticdb import Arctic

arctic = Arctic('lmdb://test')
try:
    arctic.create_library('test')
except:
    pass
lib = arctic.get_library('test')

t1 = datetime.now()
sleep(1)

df1 = pd.DataFrame({'a': 1}, index=[t1])
lib.write('data', df1)

sleep(1)
t2 = datetime.now()
sleep(1)

df2 = pd.DataFrame({'b': 2}, index=[t2])
lib.write('data', df2)

sleep(1)
t3 = datetime.now()
sleep(1)

print('t1', lib.read('data', as_of=t1).data)
print('t2', lib.read('data', as_of=t2).data)
print('t3', lib.read('data', as_of=t3).data)

Expected Results

I expect:

t1                            NoDataFoundException
t2                             a
2023-06-13 15:10:23.777745  1
t3                             b
2023-06-13 15:10:25.803166  2

Instead I get:

t1                             b
2023-06-13 15:10:25.803166  2
t2                             b
2023-06-13 15:10:25.803166  2
t3                             b
2023-06-13 15:10:25.803166  2

Whilst if I replace with 'utcnow' I get:

t1                            NoDataFoundException
t2                            NoDataFoundException
t3                             b
2023-06-13 14:18:46.728255  2
  1. Why does it interpret a local/naïve datetime as UTC?
  2. Why does it not find an item for t2 when I use utcnow?
  3. How can I find out how to use as_of?

OS, Python Version and ArcticDB Version

Python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0] OS: Linux-4.4.0-19041-Microsoft-x86_64-with-glibc2.31 ArcticDB: 1.2.1

Backend storage used

No response

Additional Context

No response

jamesmunro commented 1 year ago

According to Python docs, the preferred way to interpret naive datetimes is as local time, and 'UTC' times should be passed with 'UTC' timezone. https://docs.python.org/3/library/datetime.html#datetime.datetime.utcnow

mehertz commented 1 year ago

Hi @jamesmunro

This is because each write deletes the prior versions. To change this, set prune_previous_versions=False on the write.

I'm sympathetic to making this the default if you feel strongly about it. Though this would be a breaking issue for existing users which will need to be taken into account.

jamesmunro commented 1 year ago

Thanks that does fix the case for when using utcnow.

I think the other questions are still valid.

  1. Why does it interpret a local/naïve datetime as UTC?
  2. Why does it not find an item for t2 when I use utcnow?
  3. How can I find out how to use as_of?
mehertz commented 1 year ago

Why does it interpret a local/naïve datetime as UTC?

How can I find out how to use as_of?

Okay, those are definitely valid issues! Will keep the ticket open then.

To be clear, are you suggesting that we should change the behaviour such that the read method, if a naive datetime is given, it is assumed to be in the users local timezone and localised to utc? In the case of your snippet, that would mean utcnow isn't required.

jamesmunro commented 1 year ago

To be clear, are you suggesting that we should change the behaviour such that the read method, if a naive datetime is given, it is assumed to be in the users local timezone and localised to utc? In the case of your snippet, that would mean utcnow isn't required.

Yes. In my view that would have been more obvious. Others may see differently. I expect the versioning implementation should still be done in UTC, but it's the interpretation of the argument I'm questioning.