vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.27k stars 590 forks source link

[BUG-REPORT] Null vconstant dtypes converted to null in export_many() #2314

Open NickCrews opened 1 year ago

NickCrews commented 1 year ago
import shutil
import vaex
from pathlib import Path

df = vaex.from_arrays(normal=[1, 2, 3])
df["str"] = vaex.vconstant("hi", length=len(df), dtype="str")
df["str_null"] = vaex.vconstant(None, length=len(df), dtype="str")
df["int"] = vaex.vconstant(5, length=len(df), dtype="int8")
df["int_null"] = vaex.vconstant(None, length=len(df), dtype="int8")

d = Path("export_test/")
shutil.rmtree(d, ignore_errors=True)
d.mkdir(exist_ok=True)

# dtype is preserved fine when writing to one file
one = d / "single.parquet"
df.export(one)
df_one = vaex.open(one)

# bug: dtype is converted to null when writing to multiple files
write_many = d / "chunk-{i}.parquet"
read_many = d / "chunk-*.parquet"
df.export_many(write_many, chunk_size=1)
df_many = vaex.open(read_many)

print(df.dtypes)
print(df_one.dtypes)
print(df_many.dtypes)

results in

normal       int64
str         string
str_null    string
int           int8
int_null      int8
dtype: object
normal       int64
str         string
str_null    string
int           int8
int_null      int8
dtype: object
normal       int64
str         string
str_null      null
int          int64
int_null      null
dtype: object

I debugged a little bit, and I think the problem is present at https://github.com/vaexio/vaex/blob/652937db59ef099a42ad650cdb19567dcbe1905a/packages/vaex-core/vaex/dataframe.py#L6445 where DataFrame.data_type() is incorrectly returning null for the "str_null" and "int_null". But I might be wrong. At this point I gave up and wrote a workaround by avoiding the use of vaex.vconstant().

Software information