[BUG-REPORT] Unable to manage timestamps 'easily' when deriving them from `int` stored in arrow file.

Description Timestamps seem to be quite troublesome. I checked the other issues related to timestamps.

2 are reporting trouble with respect to reading/writing to/from parquet. (#646, #587)
1 is about an uncorrect conversion to/from a pyarrow table #888

None of them seem to report the behavior I am witnessing here.

The trouble I would like to report is that datetime64 type is not preseved when reading int from arrow files and then transforming them to timestamps. In this case, type is then pyarrow.lib.TimestampScalar.

But it does be datetime64 when these int come from memory.

My next trouble is that pyarrow.lib.TimestampScalar appears not that easy to manage. I am forced to convert them to string to be able to convert then them back to datetime64. Rather awkward...

Here is the trouble illustrated.

import numpy as np
import vaex as vx

# Case ok: when timestamps are derived from int stored in memory.
ts = [1580515230897, 1627875788076, 1627875788076]
vdf = vx.from_arrays(ts = ts)
vdf['ts'] = vdf['ts'].astype('datetime64[ms]')

first_ts = vdf['ts'][:1].values[0]
first_ts
Out[1]: numpy.datetime64('2020-02-01T00:00:30.897')
type(first_ts)
Out[2]: numpy.datetime64
# it is ok: these timestamps can be compared to np.datetime64.

if np.datetime64('2021-01-01') > first_ts:
    print('earlier')
Out[3]: earlier
# no error message, good.

Now exporting the same int to arrow, reading them back (memory mapping) and converting again.

import numpy as np
import vaex as vx

# Case nook: when timestamps are derived from int stored in arrow file.
ts = [1580515230897, 1627875788076, 1627875788076]
vdf = vx.from_arrays(ts = ts)
file = '/home/yoh/Documents/code/data/vaex_ts.arrow'
vdf.export_arrow(file)
vdf = vx.open(file)
vdf['ts'] = vdf['ts'].astype('datetime64[ms]')

first_ts = vdf['ts'][:1].values[0]
type(first_ts)
Out[3]: pyarrow.lib.TimestampScalar
# pyarrow.lib.TimestampScalar?   oh my, what is this beast? Can it be compared to np.datetime64?

if np.datetime64('2021-01-01') > first_ts:
    print('earlier')
Traceback (most recent call last):

  File "<ipython-input-4-d8b2abdbd133>", line 1, in <module>
    if np.datetime64('2021-01-01') > first_ts:

TypeError: '>' not supported between instances of 'datetime.date' and 'pyarrow.lib.TimestampScalar'

# hm, let's try to convert 1st
import pandas as pd
pd.to_datetime(first_ts)
TypeError: <class 'pyarrow.lib.TimestampScalar'> is not convertible to datetime

# 2nd try
pd.to_datetime(first_ts.value)
Out[8]: Timestamp('1970-01-01 00:26:20.515230897')
# error, not the right conversion.... my timestamp is '2020-02-01 00:00:30.897000'

# 3rd try
pd.to_datetime(str(first_ts))
Out[9]: Timestamp('2020-02-01 00:00:30.897000')
# yes, that is it!

Please,

is this behavior normal? (not the same timestamp type when timestamps are derived from int either coming from an arrow file or coming from a list)
what would be the right way of 'converting' these pyarrow.lib.TimestampScalar so as to be able to compare them to numpy.datetime64?

Thanks for your help. Bests,

Software information

Vaex version: {'vaex': '4.5.0', 'vaex-core': '4.5.1', 'vaex-viz': '0.5.0', 'vaex-hdf5': '0.10.0', 'vaex-server': '0.6.1', 'vaex-astro': '0.9.0', 'vaex-jupyter': '0.6.0', 'vaex-ml': '0.14.0'}
Vaex was installed via: conda-forge
OS: Ubuntu 20.04
pyarrow: '6.0.0'

Hi,

So there are few points worth addressing here. In no particular order:

First of all - there is nothing wrong on the vaex end of things. Everything works as expected.
In your example, you are taking data outside of vaex (this is what .values does). Once you do that.. well it is out of our hands, and what you do with it is on you. In this case you are comparing numpy and pyarrow types, and I guess they do not play well with each other. But that is up to numpy and arrow to handle not vaex.
If you do the comparison within vaex, for example something like df.ts > np.datetime64('2021-01-01'), this will work no matter if the underlying data is in numpy or arrow format. This is within vaex, and vaex will handle it.
Now, what is going wrong for you. So you create a dataframe in memory, and you use numpy to do it. Thus all the data is in numpy format. Then you decide to save the data into .arrow file, and that assumes an arrow format as well (see the next point on this). So when you read that arrow file, all of the data (the raw data) will be in arrow format. You can check this via this simple example:

import vaex

ts = [1580515230897, 1627875788076, 1627875788076]
x = [1, 2, 3]
df = vaex.from_arrays(ts = ts, x=x)
df['ts'] = df['ts'].astype('datetime64[ms]')
df.export('tmp.arrow')
df = vaex.open('tmp.arrow')
df.x.values # This gets the "raw" data outside of vaex

Vaex supports two ways in which data is handles in the "back" One is numpy, the other is arrow. Within vaex, you should not feel the difference, of even know anything about it. An int is an int, float is a float etc.. meaning you do not need to know or care if the raw data is in numpy or arrow format. Once the data leaves vaex.. than it is on you!
Some subtleties: there is no real string support in numpy, so everything string is handled by arrow internally in vaex. For datetime - while arrow has some support for it, all datetime operations that vaex supports and done by numpy. This is an area i'd like vaex to improve in, in the future, if an opportunity arises.
It might be a good idea for you (or anyone that reads this) to get a passing familiarity with pyarrow, at least so that you are not surprised if you see anything "strange", i.e. not numpy. Arrow is not just a file format, it is much more than that.
So df.x.values gets you the raw data. If you want to enforce that the data extracted from vaex is in say numpy, you should do df.x.to_numpy(). There is also .to_arrow() if you want to get data in arrow format.

I hope this clears up any confusion!

To add to what Jovan said, df.x.tolist() can also help, if you want to do some work in 'Python land'. Let us know if this answered your questions, if so, feel free to close this.

Hi @maartenbreddels hi @JovanVeljanoski , Thanks a lot, yes this answers my questions. Thanks a lot! Bests,

vaexio / vaex

[BUG-REPORT] Unable to manage timestamps 'easily' when deriving them from `int` stored in arrow file. #1704