vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

[BUG-REPORT] Filling missing/NaN values in dataframe in a loop modifies only the last column #2159

Closed Ydv-aakash closed 2 years ago

Ydv-aakash commented 2 years ago
from scipy import stats as st
import numpy as np
import vaex as vx

a=[1,2,3,np.NaN,5,6]
b=['a','b','b',None,None,'u']
c=['michael','dwight','jim','pam',None,'stanley']
df=vx.from_arrays(x=a, y=b,z=c)

for col in df.column_names:
    if df[col].dtype=='string':
        temp_df = df.fillna(value='dontknow', column_names=[col])
    else:
        temp_df = df.fillna(value=st.mode(df[col].values)[0][0], column_names=[col])

I am using the above code to replace missing/Nan values with mode and a string value for numerical and categorical features respectively. But, after running this code, only last column is modified.

JovanVeljanoski commented 2 years ago

That is happening because for each pass in the loop, you are creating a new temp_df dataframe, erasing the stuff you did in the previous pass in the loop.

I modified your code a bit to make it work:


import vaex  # do not import vaex as vx please
import numpy as np

a = [1, 2, 3, np.nan, 5, 6]
b = ['a', 'b', 'b', None, None, 'u']
c = ['michael', 'dwight', 'jim', 'pam', None, 'stanley']
df = vaex.from_arrays(x=a, y=b, z=c)

temp_df = df.copy() # This is a shallow copy, no memory is used

for col in temp_df.get_column_names():
    if temp_df[col].is_string():
        temp_df = temp_df.fillna(value='dontknow', column_names=[col])
    else:
        mode = df[col].value_counts(dropna=True).index[0]  # This will run out of core, but you need to check for ties yourself
        temp_df = temp_df.fillna(value=mode, column_names=[col])

I hope this helps!