vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] datetime64 objects save in feather/arrow format incorrectly when converted from pandas #1475

Closed Nicholas-Schaub closed 3 years ago

Nicholas-Schaub commented 3 years ago

Description There appears to be an issue reloading a saved dataframe in arrow/feather format after converting a pandas dataframe containing a datetime64 column/index. This does not occur when saving as hdf5. The general workflow to reproduce the error is:

  1. Create a pandas dataframe with a datetime64 column/index
  2. Convert using from_pandas
  3. Export using either export_feather or export_arrow
  4. Use open on the file and perform any operation on the datetime64 column

A short script to reproduce the error is:

import pandas as pd
import vaex as vx

df = pd.DataFrame({'float': [1.0],
                   'int': [1],
                   'datetime': [pd.Timestamp('20180310')],
                   'string': ['foo']})
new_df = vx.from_pandas(df)

new_df.export_feather('test.arrow')
new_df.export_hdf5('test.hdf5')

hdf5_df = vx.open('test.hdf5')
feather_df = vx.open('test.arrow')

hdf5_df.datetime.max()    # this works fine
feather_df.datetime.max() # this is where the error is thrown

The stack trace is:

Traceback (most recent call last):
  File "/mnt/c/Users/user/Documents/Python Scripts/finance/vaex_text.py", line 29, in <module>
    print(feather_df.datetime.max())
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/expression.py", line 677, in max
    return self.ds.max(**kwargs)
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 1362, in max
    return self._compute_agg('max', expression, binby, limits, shape, selection, delay, edges, progress, array_type=array_type)
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 773, in _compute_agg
    return self._delay(delay, var)
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 1537, in _delay
    self.execute()
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 375, in execute
    just_run(self.execute_async())
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/nest_asyncio.py", line 70, in run_until_complete
    return f.result()
  File "/home/user/anaconda3/envs/py39/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/home/user/anaconda3/envs/py39/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/dataframe.py", line 380, in execute_async
    await self.executor.execute_async()
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/execution.py", line 176, in execute_async
    task._parts = [encoding.decode('task-part-cpu', spec, df=run.df) for i in range(self.thread_pool.nthreads)]
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/execution.py", line 176, in <listcomp>
    task._parts = [encoding.decode('task-part-cpu', spec, df=run.df) for i in range(self.thread_pool.nthreads)]
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 449, in decode
    decoded = self.registry[typename].decode(self, value, **kwargs)
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/cpu.py", line 38, in decode
    return cls.decode(encoding, spec, df)
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/cpu.py", line 551, in decode
    dtypes = encoding.decode_dict('dtype', spec['dtypes'])
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 469, in decode_dict
    decoded = {key: self.registry[typename].decode(self, value, **kwargs) for key, value in values.items()}
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 469, in <dictcomp>
    decoded = {key: self.registry[typename].decode(self, value, **kwargs) for key, value in values.items()}
  File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/vaex/encoding.py", line 247, in decode
    return DataType(np.dtype(type_spec))
TypeError: data type 'timestamp[ns]' not understood

Software information Vaex and pandas were installed using Pip and using Python 3.9.2. I am running on a Windows machine under WSL. I can reproduce on multiple operating systems if that would be helpful.

Ubuntu 20.04.1 LTS (GNU/Linux 4.19.128-microsoft-standard x86_64)
Python 3.9.2

Name: vaex
Version: 4.3.0
Summary: Out-of-Core DataFrames to visualize and explore big tabular datasets
Home-page: https://www.github.com/vaexio/vaex
Author: Maarten A. Breddels
Author-email: maartenbreddels@gmail.com
License: MIT
Location: /home/user/anaconda3/envs/py39/lib/python3.9/site-packages
Requires: vaex-core, vaex-server, vaex-hdf5, vaex-ml, vaex-astro, vaex-jupyter, vaex-viz
Required-by:

Name: pandas
Version: 1.3.1
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /home/user/anaconda3/envs/py39/lib/python3.9/site-packages
Requires: python-dateutil, numpy, pytz
Required-by: xarray, vaex-core, bqplot, alpaca-trade-api

Additional information I attempted to try a few things like attempting to make a deep copy of the data and saving as an hdf5, opening and resaving as an arrow/feather. I have not found a workaround other than to just use hdf5.

maartenbreddels commented 3 years ago

Thanks for the report, this is fixed in master (just checked), and probably due to merging #1300