Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description
Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.
Software information
Vaex version (import vaex; vaex.__version__): 4.16.1
Vaex was installed via: pip / conda-forge / from source pip
We create a dataframe where the text column is larger than the max size for a string. We write it to an arrow file, and that's fine. Note here that if we wrote this to an hdf5 file without converting to a large string, we'd have an error... see 2334
import vaex
import numpy as np
# from https://issues.apache.org/jira/browse/ARROW-17828
# arrow string maxsize is 2GB. Then you need large string
maxsize_string = 2e9
df = vaex.example()
df["text"] = vaex.vconstant("OHYEA"*2000, len(df))
df["id"] = np.array(list(range(len(df))))
assert df.text.str.len().sum() > maxsize_string
df[["x", "y", "z", "text", "id"]].export("part1.arrow")
Part 2
If we join this dataframe with another, we see that there are no issues
# Works fine
newdf = vaex.example()[["id"]]
newdf["col1"] = vaex.vconstant("foo", len(newdf))
join_df = newdf.join(df, on="id")
join_df
Part 3 (issue)
BUT, if we instead read that file from disk where it's memory mapped, joining to another dataframe causes an issue
# Reading from disk will break the join
newdf2 = vaex.example()[["id"]]
newdf3 = vaex.open("part1.arrow")
newdf2["col1"] = vaex.vconstant("foo", len(newdf))
join_df2 = newdf2.join(newdf3, on="id")
join_df2
This throws the error
return call_function('take', [data, indices], options,
memory_pool)
File "pyarrow/_compute.pyx", line 560, in
pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 355, in
pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.
Software information
import vaex; vaex.__version__)
: 4.16.1Additional information This likely has to do with both https://issues.apache.org/jira/browse/ARROW-17828 and is probably related to https://github.com/vaexio/vaex/issues/2334
Part 1
We create a dataframe where the text column is larger than the max size for a string. We write it to an arrow file, and that's fine. Note here that if we wrote this to an hdf5 file without converting to a large string, we'd have an error... see 2334
Part 2
If we join this dataframe with another, we see that there are no issues
Part 3 (issue)
BUT, if we instead read that file from disk where it's memory mapped, joining to another dataframe causes an issue
This throws the error
Related issues
This is, I think, the root cause: https://issues.apache.org/jira/browse/ARROW-9773 This is certainly related https://issues.apache.org/jira/browse/ARROW-17828
But more useful, I think the core of the issue is actually from the
.take
used hereAnd the reason I think that is because of this very related issue from the huggingface datasets repo https://github.com/huggingface/datasets/issues/615 and the fix that they have simply removes the use of
take
https://github.com/huggingface/datasets/pull/645