[BUG-REPORT] Filling missing/NaN values in dataframe in a loop modifies only the last column

vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

MIT License

8.25k stars 590 forks source link

from scipy import stats as st import numpy as np import vaex as vx a=[1,2,3,np.NaN,5,6] b=['a','b','b',None,None,'u'] c=['michael','dwight','jim','pam',None,'stanley'] df=vx.from_arrays(x=a, y=b,z=c) for col in df.column_names: if df[col].dtype=='string': temp_df = df.fillna(value='dontknow', column_names=[col]) else: temp_df = df.fillna(value=st.mode(df[col].values)[0][0], column_names=[col])

That is happening because for each pass in the loop, you are creating a new temp_df dataframe, erasing the stuff you did in the previous pass in the loop.

I modified your code a bit to make it work:


import vaex  # do not import vaex as vx please
import numpy as np

a = [1, 2, 3, np.nan, 5, 6]
b = ['a', 'b', 'b', None, None, 'u']
c = ['michael', 'dwight', 'jim', 'pam', None, 'stanley']
df = vaex.from_arrays(x=a, y=b, z=c)

temp_df = df.copy() # This is a shallow copy, no memory is used

for col in temp_df.get_column_names():
    if temp_df[col].is_string():
        temp_df = temp_df.fillna(value='dontknow', column_names=[col])
    else:
        mode = df[col].value_counts(dropna=True).index[0]  # This will run out of core, but you need to check for ties yourself
        temp_df = temp_df.fillna(value=mode, column_names=[col])

I hope this helps!

vaexio / vaex

[BUG-REPORT] Filling missing/NaN values in dataframe in a loop modifies only the last column #2159