vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] Unable to manage timestamps 'easily' when deriving them from `int` stored in arrow file. #1704

Closed yohplala closed 3 years ago

yohplala commented 3 years ago

Description Timestamps seem to be quite troublesome. I checked the other issues related to timestamps.

None of them seem to report the behavior I am witnessing here.

The trouble I would like to report is that datetime64 type is not preseved when reading int from arrow files and then transforming them to timestamps. In this case, type is then pyarrow.lib.TimestampScalar.

But it does be datetime64 when these int come from memory.

My next trouble is that pyarrow.lib.TimestampScalar appears not that easy to manage. I am forced to convert them to string to be able to convert then them back to datetime64. Rather awkward...

Here is the trouble illustrated.

import numpy as np
import vaex as vx

# Case ok: when timestamps are derived from int stored in memory.
ts = [1580515230897, 1627875788076, 1627875788076]
vdf = vx.from_arrays(ts = ts)
vdf['ts'] = vdf['ts'].astype('datetime64[ms]')

first_ts = vdf['ts'][:1].values[0]
first_ts
Out[1]: numpy.datetime64('2020-02-01T00:00:30.897')
type(first_ts)
Out[2]: numpy.datetime64
# it is ok: these timestamps can be compared to np.datetime64.

if np.datetime64('2021-01-01') > first_ts:
    print('earlier')
Out[3]: earlier
# no error message, good.

Now exporting the same int to arrow, reading them back (memory mapping) and converting again.

import numpy as np
import vaex as vx

# Case nook: when timestamps are derived from int stored in arrow file.
ts = [1580515230897, 1627875788076, 1627875788076]
vdf = vx.from_arrays(ts = ts)
file = '/home/yoh/Documents/code/data/vaex_ts.arrow'
vdf.export_arrow(file)
vdf = vx.open(file)
vdf['ts'] = vdf['ts'].astype('datetime64[ms]')

first_ts = vdf['ts'][:1].values[0]
type(first_ts)
Out[3]: pyarrow.lib.TimestampScalar
# pyarrow.lib.TimestampScalar?   oh my, what is this beast? Can it be compared to np.datetime64?

if np.datetime64('2021-01-01') > first_ts:
    print('earlier')
Traceback (most recent call last):

  File "<ipython-input-4-d8b2abdbd133>", line 1, in <module>
    if np.datetime64('2021-01-01') > first_ts:

TypeError: '>' not supported between instances of 'datetime.date' and 'pyarrow.lib.TimestampScalar'

# hm, let's try to convert 1st
import pandas as pd
pd.to_datetime(first_ts)
TypeError: <class 'pyarrow.lib.TimestampScalar'> is not convertible to datetime

# 2nd try
pd.to_datetime(first_ts.value)
Out[8]: Timestamp('1970-01-01 00:26:20.515230897')
# error, not the right conversion.... my timestamp is '2020-02-01 00:00:30.897000'

# 3rd try
pd.to_datetime(str(first_ts))
Out[9]: Timestamp('2020-02-01 00:00:30.897000')
# yes, that is it!

Please,

Thanks for your help. Bests,

Software information

JovanVeljanoski commented 3 years ago

Hi,

So there are few points worth addressing here. In no particular order:

import vaex

ts = [1580515230897, 1627875788076, 1627875788076]
x = [1, 2, 3]
df = vaex.from_arrays(ts = ts, x=x)
df['ts'] = df['ts'].astype('datetime64[ms]')
df.export('tmp.arrow')
df = vaex.open('tmp.arrow')
df.x.values # This gets the "raw" data outside of vaex

I hope this clears up any confusion!

maartenbreddels commented 3 years ago

To add to what Jovan said, df.x.tolist() can also help, if you want to do some work in 'Python land'. Let us know if this answered your questions, if so, feel free to close this.

yohplala commented 3 years ago

Hi @maartenbreddels hi @JovanVeljanoski , Thanks a lot, yes this answers my questions. Thanks a lot! Bests,