vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.29k stars 589 forks source link

max() on datetime type return 'NaT' #1340

Open heyuqi1970 opened 3 years ago

heyuqi1970 commented 3 years ago

I use max() on datetime type column, it return 'NaT'.

df.max(df.Invoice_Date) Out[5]: array('NaT', dtype='datetime64[ns]') df.minmax(df.Invoice_Date) Out[6]: array([ 'NaT', '2017-01-04T00:00:26.230259712'], dtype='datetime64[ns]')

Software information

kmcentush commented 3 years ago

Can you create a reproducible test case that I could run?

heyuqi1970 commented 3 years ago

Hi @kmcentush

please refer to follow:

import vaex as vx
import numpy as  #np
x = np.array(['2016-01-04T00:00:00.000000000', '2016-01-04T00:00:00.000000000','2016-01-04T00:00:00.000000000','2017-01-04T00:00:00.000000000', '2017-01-04T00:00:00.000000000','NaT'], dtype='datetime64[ns]')
df = vx.from_arrays(x=x)
df.x.max()
Out[6]: array('NaT', dtype='datetime64[ns]')
df.x.minmax()
Out[7]: 
array([                          'NaT', '2017-01-04T00:00:26.230259712'],
      [dtype='datetime64[ns]')]
df.x.min()
Out[8]: array('2016-01-04T00:00:00.000000000', dtype='datetime64[ns]')
Alon-Alexander commented 3 years ago

Hi @heyuqi1970 I tried your example on a numpy array (just x, not df.x) and its behaviour is the same. Probably @maartenbreddels should decide if the behaviour should be kept or changed.

As a temporary solution, you can filter the data only to existing values (non nat values) and only then perform the max operation.