vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.27k stars 590 forks source link

Unexpected Discrepancy Between Printed Values and Returned Results in vaex.apply() Function #2420

Open johannesmphaka opened 6 months ago

johannesmphaka commented 6 months ago

Hey there. I'm currently experimenting with using vaex for processing large datasets in Python. I encountered an unexpected behavior when applying a custom function using vaex.apply. Specifically, while printing the result within the function yields the correct output, the returned value seems to be incorrect. Here's a simplified version of my code:

import numpy as np import pandas as pd import vaex from scipy.stats import gamma

Creating a DataFrame

d = {'A':[i for i in range(1000000)]} df = pd.DataFrame(data=d) a, b = 0.09717545806463647, 407034.13749400195

Setting up random seed

np.random.seed(1234)

Defining a custom function

def my_func(A): f = np.random.poisson(lam=100) sim = np.random.uniform(low=0, high=1, size=f) lossx1 = np.sum(gamma.ppf(sim, a, scale=b)) print(lossx1) # Printing the loss value for debugging return np.array(lossx1)

Converting DataFrame to vaex DataFrame

df_vaex = vaex.from_pandas(df)

Applying the function using vaex

df_result = df_vaex.apply(my_func, arguments=[df_vaex["A"]], vectorize=True, multiprocessing=False).values

Software information