vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Offset Overflow when joining dataframes with large strings #2335

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Additional information This likely has to do with both https://issues.apache.org/jira/browse/ARROW-17828 and is probably related to https://github.com/vaexio/vaex/issues/2334

Part 1

We create a dataframe where the text column is larger than the max size for a string. We write it to an arrow file, and that's fine. Note here that if we wrote this to an hdf5 file without converting to a large string, we'd have an error... see 2334

import vaex
import numpy as np

# from https://issues.apache.org/jira/browse/ARROW-17828
# arrow string maxsize is 2GB. Then you need large string
maxsize_string = 2e9

df = vaex.example()
df["text"] = vaex.vconstant("OHYEA"*2000, len(df))
df["id"] = np.array(list(range(len(df))))

assert df.text.str.len().sum() > maxsize_string

df[["x", "y", "z", "text", "id"]].export("part1.arrow")

Part 2

If we join this dataframe with another, we see that there are no issues

# Works fine
newdf = vaex.example()[["id"]]
newdf["col1"] = vaex.vconstant("foo", len(newdf))

join_df = newdf.join(df, on="id")
join_df
image

Part 3 (issue)

BUT, if we instead read that file from disk where it's memory mapped, joining to another dataframe causes an issue

# Reading from disk will break the join
newdf2 = vaex.example()[["id"]]
newdf3 = vaex.open("part1.arrow")

newdf2["col1"] = vaex.vconstant("foo", len(newdf))

join_df2 = newdf2.join(newdf3, on="id")
join_df2

This throws the error

    return call_function('take', [data, indices], options,                            
                             memory_pool)                                                                          
                               File "pyarrow/_compute.pyx", line 560, in                                           
                             pyarrow._compute.call_function                                                        
                               File "pyarrow/_compute.pyx", line 355, in                                           
                             pyarrow._compute.Function.call                                                        
                               File "pyarrow/error.pxi", line 144, in                                              
                             pyarrow.lib.pyarrow_internal_check_status                                             
                               File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status                     
                             pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays     

Related issues

This is, I think, the root cause: https://issues.apache.org/jira/browse/ARROW-9773 This is certainly related https://issues.apache.org/jira/browse/ARROW-17828

But more useful, I think the core of the issue is actually from the .take used here

.venv/lib/python3.9/                  
                             site-packages/pyarrow/compute.py", line 470, in take                                  
                                 return call_function('take', [data, indices], options,                            
                             memory_pool)                              

And the reason I think that is because of this very related issue from the huggingface datasets repo https://github.com/huggingface/datasets/issues/615 and the fix that they have simply removes the use of take https://github.com/huggingface/datasets/pull/645