vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] to_arrays(array_type='numpy') fails on Date32Array column type #1668

Closed vadikmironov closed 3 years ago

vadikmironov commented 3 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description When working with a dataset that have a date column, I've got an issue related to arrow Date32Array column type (describe fails with weird error about type conversion not implemented which is likely to be in pyarrow and is fair enough). However, when playing around with array_type='numpy' parameter I hit another issue which seems similar to https://github.com/vaexio/vaex/issues/1045 , but that issue was reported as closed in some early 4.0 alpha.

I've been able to narrow this down to the following small snippet that demonstrates the problem:

import numpy as np
import vaex

if __name__ == '__main__':
    print(vaex.__version__)

    arrays_numpy = {'label': np.array(['date1', 'date2']),
                    'date': np.array([np.datetime64('2021-10-01'), np.datetime64('2021-10-02')])}
    test_df_numpy = vaex.from_arrays(**arrays_numpy)
    arrow_arrays = test_df_numpy.to_arrays(array_type='arrow')
    test_df_arrow = vaex.from_arrays(label=arrow_arrays[0], date=arrow_arrays[1])
    numpy_arrays = test_df_arrow.to_arrays(array_type='numpy')

which fails with the following:

File "D:\dev\projects\python_scratchpad/test_python.py", line 12, in <module>
    numpy_arrays = test_df_arrow.to_arrays(array_type='numpy')
  File "d:\dev\projects\.venv\lib\site-packages\vaex\dataframe.py", line 3024, in to_arrays
    return [array_types.convert(chunk, array_type) for chunk in self.evaluate(column_names, selection=selection, parallel=parallel)]
  File "d:\dev\projects\.venv\lib\site-packages\vaex\dataframe.py", line 3024, in <listcomp>
    return [array_types.convert(chunk, array_type) for chunk in self.evaluate(column_names, selection=selection, parallel=parallel)]
  File "d:\dev\projects\.venv\lib\site-packages\vaex\array_types.py", line 157, in convert
    return to_numpy(x, strict=True)
  File "d:\dev\projects\.venv\lib\site-packages\vaex\array_types.py", line 131, in to_numpy
    x = vaex.arrow.convert.column_from_arrow_array(x)
  File "d:\dev\projects\.venv\lib\site-packages\vaex\arrow\convert.py", line 84, in column_from_arrow_array
    return numpy_array_from_arrow_array(arrow_array)
  File "d:\dev\projects\.venv\lib\site-packages\vaex\arrow\convert.py", line 130, in numpy_array_from_arrow_array
    array = np.frombuffer(data_buffer, dtype, len(arrow_array) + offset)[offset:]
ValueError: buffer is smaller than requested size

Software information

JovanVeljanoski commented 3 years ago

Hi,

Thank you for reporting this. Actually issue you have reported is fixed in the latest version in master. I believe this is out in the latest alpha (not sure tho).

However, running your exact example raises another error:

~/vaex/packages/vaex-core/vaex/array_types.py in numpy_dtype_from_arrow_type(arrow_type, strict)
    287         return map_arrow_to_numpy[arrow_type]
    288     except KeyError:
--> 289         raise NotImplementedError(f'Cannot convert {arrow_type}')
    290 
    291 

NotImplementedError: Cannot convert date32[day]

This is because numpy has only datetime64, i.e. there is no such thing as datetime32 in numpy. A way around this is to force numpy to operate on the nanosecond level instead on the day level as in your example. This requires explicitly stating the dype like this (following your example)

arrays_numpy = {'label': np.array(['date1', 'date2']),
                'date': np.array([np.datetime64('2021-10-01'), np.datetime64('2021-10-02')], dtype='datetime64[ns]')}

When converting to arrow now date64 will be used which can be converted back to numpy.datetime64.

I hope this helps! J.

vadikmironov commented 3 years ago

Thanks a lot. I'll close the issue now and check once the version is cut and published.

JovanVeljanoski commented 3 years ago

Actually, you may try the "fix" i described earlier (specify the ns type), it might work for your version already.