vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] dataframes with columns of large strings cannot be concatenated #2216

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description vaex dfs with really long strings seem to have issues with I/O. I'm sure it's related to pyarrow in some way

Software information

Additional information

import numpy as np
import vaex

x = str(np.random.randint(low=0, high=255, size=(32, 32, 10)).flatten().tolist())
df = vaex.from_arrays(id=list(range(125000)), y=np.random.randint(low=0,high=1000,size=125000))
df["text"] = vaex.vconstant(x, len(df))

df.export("data0.arrow",progress="rich")
df.export("data1.arrow",progress="rich")
df.export("data2.arrow",progress="rich")
df.export("data3.arrow",progress="rich")

# Validate all files are good
for i in range(4):
    assert vaex.open(f"data{i}.arrow")

vaex.open("data*.arrow", convert="all_data.arrow", progress="rich")
"""
~.venv/lib/python3.7/site-packages/vaex/hdf5/writer.py in write(self, df, chunk_size, parallel, progress, column_count, export_threads)
    128                         list(pool.map(write, enumerate(column_names_subgroup)))
    129                     else:
--> 130                         list(map(write, enumerate(column_names_subgroup)))
    131 
    132 

~.venv/lib/python3.7/site-packages/vaex/hdf5/writer.py in write(arg)
    123                     def write(arg):
    124                         i, name = arg
--> 125                         self.column_writers[name].write(values[i])
    126                         progressbar_columns[name](self.column_writers[name].progress)
    127                     if export_threads:

~.venv/lib/python3.7/site-packages/vaex/hdf5/writer.py in write(self, values)
    304         if no_values:
    305             # to_column = to_array
--> 306             from_sequence = _to_string_sequence(values)
    307             to_sequence = self.to_array.string_sequence.slice(self.to_offset, self.to_offset+no_values, self.string_byte_offset)
    308             self.string_byte_offset += to_sequence.fill_from(from_sequence)

~.venv/lib/python3.7/site-packages/vaex/column.py in _to_string_sequence(x, force)
    598             x = pa.array([], type=column.type)
    599         else:
--> 600             assert column.num_chunks == 1
    601             x = column.chunk(0)
    602 

AssertionError: 
"""
Ben-Epstein commented 1 year ago

Maybe i need to open a different issue for this, but it looks like vaex cannot export this as an hdf5 file at all

x = str(np.random.randint(low=0, high=255, size=(32, 32, 10)).flatten().tolist())
df = vaex.from_arrays(id=list(range(125000)), y=np.random.randint(low=0,high=1000,size=125000))
df["text"] = vaex.vconstant(x, len(df))
df.export("test.hdf5")

this fails with the same error

Ben-Epstein commented 1 year ago

Seems to happen at about 50k rows

import vaex
import numpy as np

n = 50_000

x = str(np.random.randint(low=0, high=255, size=(32, 32, 10)).flatten().tolist())
df = vaex.from_arrays(id=list(range(n)), y=np.random.randint(low=0,high=1000,size=n))
df["text"] = vaex.vconstant(x, len(df))
df.export("test.hdf5")
Ben-Epstein commented 1 year ago

Digging deeper, found that just the literal call to combine_chunks is failing in arrow. I assume this is an arrow bug then?

@maartenbreddels any ideas for a potential workaround?

import vaex
import numpy as np

n = 50_000

x = str(np.random.randint(low=0, high=255, size=(32, 32, 10)).flatten().tolist())
df = vaex.from_arrays(id=list(range(n)), y=np.random.randint(low=0,high=1000,size=n))
df["text"] = vaex.vconstant(x, len(df))

t = df.text.values.combine_chunks()
Ben-Epstein commented 1 year ago

I filed this error with arrow and got a reply: https://issues.apache.org/jira/browse/ARROW-17828

Ben-Epstein commented 1 year ago

wanted to update here with a working solution in case anyone finds themselves in a similar situation. Pyarrow strings have a 2GB size limit. So you can upcast to large_string to avoid the issue! Vaex is actually much faster at handling this than native pyarrow, so doing it all in vaex is easy

import pyarrow as pa
import vaex
import numpy as np
from vaex.dataframe import DataFrame

n = 50_000
x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
# Create a df with a string too large
df = vaex.from_arrays(
    id=list(range(n)), 
    y=np.random.randint(low=0,high=1000,size=n)
)
df["text"] = vaex.vconstant(x, len(df))

# byte limit for arrow strings
# because 1 character = 1 byte, the total number of characters in the 
# column in question must be less than the size_limit
size_limit = 2*1e9
def validate_str_cols(df: DataFrame) -> DataFrame:
    for col, dtype in zip(df.get_column_names(), df.dtypes):
        if dtype == str and df[col].str.len().sum() >= size_limit:
            df[col] = df[col].to_arrow().cast(pa.large_string())
    return df

# text is type string
print(df.dtypes)
df = validate_str_cols(df)
# test is type large_string
print(df.dtypes)

y = df.text.values.combine_chunks()  # works!
df.export("file.hdf5", progress="rich")  # works!