vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] `vaex.agg.list` combined with `col.str.join(',')` is corrupting the dataframe #2111

Closed Ben-Epstein closed 2 years ago

Ben-Epstein commented 2 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

import numpy as np
import vaex

text = vaex.from_arrays(
    id=list(range(10_000)),
    text=[f"sentence number {i}" for i in range(10_000)],
)
df = vaex.from_arrays(
    id=list(range(25_000)),
    sample_id=[i%10_000 for i in range(25_000)],
    score=np.random.rand(25_000)
)

text.export("text_data.hdf5")
df.export("data.hdf5")

text_df = vaex.open("text_data.hdf5")
data_df = vaex.open("data.hdf5")

df = data_df.join(text_df, left_on="sample_id", right_on="id", rsuffix="_R").drop("id_R")

df["result"] = '{"input":' + '"' + df.text + '"' + ', "score":' + df.score.astype('str') + "}"
df = df.groupby("sample_id", agg={"result": vaex.agg.list})  # THIS IS THE LINE THAT BREAKS THINGS

df["result"] = "[" + df.result.str.join(",") + "]"
display(df[df["sample_id"]==8668].result.tolist())

df.export("test.csv") 

newdf = vaex.open('test.csv')
display(newdf[newdf["sample_id"]==8668].result.tolist())

If you swap that broken line commented, and use a custom function for the desired results

import pyarrow as pa
@vaex.register_function()
def format_json(arr):
    return [",".join(a.as_py()) for a in arr]

and call df["result"] = "[" + df.result.format_json() + "]" instead, everything works

JovanVeljanoski commented 2 years ago

Thanks for this @Ben-Epstein ! It was a bit tricky to reproduce but i got it.

The issue is not in the groupby, or export. It is due to the str.join method and it seems to happen when that operates on an expression that is of (arrow) type long_list(long_string).

I hope we can fix it soon!