Open Ben-Epstein opened 1 year ago
Maybe i need to open a different issue for this, but it looks like vaex cannot export this as an hdf5 file at all
x = str(np.random.randint(low=0, high=255, size=(32, 32, 10)).flatten().tolist())
df = vaex.from_arrays(id=list(range(125000)), y=np.random.randint(low=0,high=1000,size=125000))
df["text"] = vaex.vconstant(x, len(df))
df.export("test.hdf5")
this fails with the same error
Seems to happen at about 50k rows
import vaex
import numpy as np
n = 50_000
x = str(np.random.randint(low=0, high=255, size=(32, 32, 10)).flatten().tolist())
df = vaex.from_arrays(id=list(range(n)), y=np.random.randint(low=0,high=1000,size=n))
df["text"] = vaex.vconstant(x, len(df))
df.export("test.hdf5")
Digging deeper, found that just the literal call to combine_chunks
is failing in arrow. I assume this is an arrow bug then?
@maartenbreddels any ideas for a potential workaround?
import vaex
import numpy as np
n = 50_000
x = str(np.random.randint(low=0, high=255, size=(32, 32, 10)).flatten().tolist())
df = vaex.from_arrays(id=list(range(n)), y=np.random.randint(low=0,high=1000,size=n))
df["text"] = vaex.vconstant(x, len(df))
t = df.text.values.combine_chunks()
I filed this error with arrow and got a reply: https://issues.apache.org/jira/browse/ARROW-17828
wanted to update here with a working solution in case anyone finds themselves in a similar situation. Pyarrow strings have a 2GB size limit. So you can upcast to large_string
to avoid the issue! Vaex is actually much faster at handling this than native pyarrow, so doing it all in vaex is easy
import pyarrow as pa
import vaex
import numpy as np
from vaex.dataframe import DataFrame
n = 50_000
x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
# Create a df with a string too large
df = vaex.from_arrays(
id=list(range(n)),
y=np.random.randint(low=0,high=1000,size=n)
)
df["text"] = vaex.vconstant(x, len(df))
# byte limit for arrow strings
# because 1 character = 1 byte, the total number of characters in the
# column in question must be less than the size_limit
size_limit = 2*1e9
def validate_str_cols(df: DataFrame) -> DataFrame:
for col, dtype in zip(df.get_column_names(), df.dtypes):
if dtype == str and df[col].str.len().sum() >= size_limit:
df[col] = df[col].to_arrow().cast(pa.large_string())
return df
# text is type string
print(df.dtypes)
df = validate_str_cols(df)
# test is type large_string
print(df.dtypes)
y = df.text.values.combine_chunks() # works!
df.export("file.hdf5", progress="rich") # works!
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description vaex dfs with really long strings seem to have issues with I/O. I'm sure it's related to pyarrow in some way
Software information
import vaex; vaex.__version__)
:{'vaex-core': '4.12.0', 'vaex-hdf5': '0.12.3'}
Additional information