Description
There appears to be an issue reloading a saved dataframe in arrow/feather format after converting a pandas dataframe containing a datetime64 column/index. This does not occur when saving as hdf5. The general workflow to reproduce the error is:
Create a pandas dataframe with a datetime64 column/index
Convert using from_pandas
Export using either export_feather or export_arrow
Use open on the file and perform any operation on the datetime64 column
A short script to reproduce the error is:
import pandas as pd
import vaex as vx
df = pd.DataFrame({'float': [1.0],
'int': [1],
'datetime': [pd.Timestamp('20180310')],
'string': ['foo']})
new_df = vx.from_pandas(df)
new_df.export_feather('test.arrow')
new_df.export_hdf5('test.hdf5')
hdf5_df = vx.open('test.hdf5')
feather_df = vx.open('test.arrow')
hdf5_df.datetime.max() # this works fine
feather_df.datetime.max() # this is where the error is thrown
The stack trace is:
Traceback (most recent call last):
File "/mnt/c/Users/user/Documents/Python Scripts/finance/vaex_text.py", line 29, in <module>
print(feather_df.datetime.max())
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/expression.py", line 677, in max
return self.ds.max(**kwargs)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 1362, in max
return self._compute_agg('max', expression, binby, limits, shape, selection, delay, edges, progress, array_type=array_type)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 773, in _compute_agg
return self._delay(delay, var)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 1537, in _delay
self.execute()
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 375, in execute
just_run(self.execute_async())
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/nest_asyncio.py", line 70, in run_until_complete
return f.result()
File "/home/user/anaconda3/envs/py39/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/home/user/anaconda3/envs/py39/lib/python3.9/asyncio/tasks.py", line 256, in __step
result = coro.send(None)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 380, in execute_async
await self.executor.execute_async()
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/execution.py", line 176, in execute_async
task._parts = [encoding.decode('task-part-cpu', spec, df=run.df) for i in range(self.thread_pool.nthreads)]
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/execution.py", line 176, in <listcomp>
task._parts = [encoding.decode('task-part-cpu', spec, df=run.df) for i in range(self.thread_pool.nthreads)]
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 449, in decode
decoded = self.registry[typename].decode(self, value, **kwargs)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/cpu.py", line 38, in decode
return cls.decode(encoding, spec, df)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/cpu.py", line 551, in decode
dtypes = encoding.decode_dict('dtype', spec['dtypes'])
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 469, in decode_dict
decoded = {key: self.registry[typename].decode(self, value, **kwargs) for key, value in values.items()}
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 469, in <dictcomp>
decoded = {key: self.registry[typename].decode(self, value, **kwargs) for key, value in values.items()}
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 247, in decode
return DataType(np.dtype(type_spec))
TypeError: data type 'timestamp[ns]' not understood
Software information
Vaex and pandas were installed using Pip and using Python 3.9.2. I am running on a Windows machine under WSL. I can reproduce on multiple operating systems if that would be helpful.
Ubuntu 20.04.1 LTS (GNU/Linux 4.19.128-microsoft-standard x86_64)
Python 3.9.2
Name: vaex
Version: 4.3.0
Summary: Out-of-Core DataFrames to visualize and explore big tabular datasets
Home-page: https://www.github.com/vaexio/vaex
Author: Maarten A. Breddels
Author-email: maartenbreddels@gmail.com
License: MIT
Location: /home/user/anaconda3/envs/py39/lib/python3.9/site-packages
Requires: vaex-core, vaex-server, vaex-hdf5, vaex-ml, vaex-astro, vaex-jupyter, vaex-viz
Required-by:
Name: pandas
Version: 1.3.1
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /home/user/anaconda3/envs/py39/lib/python3.9/site-packages
Requires: python-dateutil, numpy, pytz
Required-by: xarray, vaex-core, bqplot, alpaca-trade-api
Additional information
I attempted to try a few things like attempting to make a deep copy of the data and saving as an hdf5, opening and resaving as an arrow/feather. I have not found a workaround other than to just use hdf5.
Description There appears to be an issue reloading a saved dataframe in arrow/feather format after converting a pandas dataframe containing a datetime64 column/index. This does not occur when saving as hdf5. The general workflow to reproduce the error is:
from_pandas
export_feather
orexport_arrow
open
on the file and perform any operation on the datetime64 columnA short script to reproduce the error is:
The stack trace is:
Software information Vaex and pandas were installed using Pip and using Python 3.9.2. I am running on a Windows machine under WSL. I can reproduce on multiple operating systems if that would be helpful.
Additional information I attempted to try a few things like attempting to make a deep copy of the data and saving as an hdf5, opening and resaving as an arrow/feather. I have not found a workaround other than to just use hdf5.