vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.3k stars 590 forks source link

function .isin() rasie Exception: pyarrow.lib import ArrowInvalid: Cannot append scalar of type string to builder for type large_string #2444

Open myloe00 opened 2 weeks ago

myloe00 commented 2 weeks ago

I found out by debugging that while vaex.expression.Expression.to_arrow return a dataset more than one chunk, like this:

<pyarrow.lib.ChunkedArray object at 0x000002109B68A270>
[
  [
    "B-XXXXX",
   ],
   [ 
     "C1-XXXXX"
   ]
]

this exception will be raised. So I try to change function vaex.expression.Expression.__arrow_array__ to

def __arrow_array__(self, type=None):
    values = self.to_arrow()
    try:
        res = pa.array(values, type=type)
    except ArrowInvalid as e:
        values = values.combine_chunks()
        res = pa.array(values, type=type)
    return res

And the exception resolved.

So, can i do it like this . Or other method to resovle this problem?

ddelange commented 2 weeks ago

hi @myloe00 :wave:

please fill out the bug template including a minimal reproducible example and a full stack trace