vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Cannot call `to_numpy` on list but numpy can convert directly #2116

Open Ben-Epstein opened 2 years ago

Ben-Epstein commented 2 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Additional information

import vaex
import numpy as np
import pyarrow as pa

x = vaex.from_arrays(
    vals = pa.array([[1,2,3,4], [4,3,6,2,], [5,3,2,1]])
)
x.vals.to_numpy()  # TypeError: type unsupported: ListType(list<item: int64>)

np.array(x.vals.values)  # fine
JovanVeljanoski commented 2 years ago

Hi @Ben-Epstein

I am not yet sure if this is a bug. I think we've discussed something similar to this in the past, but I can't find the thread at the moment.

The column you are creating is arrow type with is a list of lists. Numpy does not have an equivalent i believe. Which is why the export to_numpy() does not work.

What you do with your workaround is creating a numpy array of numpy arrays (which makes the type=object), and we do not support that.

maartenbreddels commented 2 years ago

Yeah, I think vaex tries to protect you from doing this, it doesn't want to create dtype=object. Should we allow this under a particular flag maybe?

JovanVeljanoski commented 2 years ago

I think we should just not support objects.. it is a bad practice and it can lead us down a dark path (like in the past.. )

Ben-Epstein commented 2 years ago

I see. In this particular example all of the lists are of the same length, so you end up with a numpy array that is not dtype object, because it's allowed.

Is that a special case to support? Or maybe it's too hard to check for?

JovanVeljanoski commented 2 years ago

As you have written it, it is not a special case: numpy array expects all elements inside to be primitives and of the same type.

What typically happens if you do something like

import numpy as np
l = [1, 2, 3]
x = np.array([l, l])

numpy will sees that the dimensions align and will convert this from a list of lists to a numpy ndarray.

But if you do

l1 = [1, 2, 3]
l2 = [1, 2]
x = np.array([l1, l2])

Such conversion is not possible and you an array of lists, which by definition is dtype object.

Actually, going back to your original example:

vals = pa.array([[1,2,3,4], [4,3,6,2,], [5,3,2,1]])

# but now
vals.to_numpy() # Gives an error for the very same reason discussed above

# but this
vaex.tolist() # will work, as it will return a list of lists.

In vaex you can have arbitrarily large data, so you do evaluations / computations in chunks. You do not know ahead of time for sure if all lists will have the same number of elements, which is required by numpy (i think) to cast a list of lists to numpy ndarray. So when you export, it needs to go to a data type that supports the structure directly (if this makes sense).