vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 589 forks source link

[BUG-REPORT] MinMaxScaler not working after filtering #2411

Open fwitter opened 5 months ago

fwitter commented 5 months ago

I have a dataset stored in parquet format. I want to filter the dataset by a categorical column and then scale the numerical columns which are the vast majority of columns.

data = vaex.open('my_parquet_file_dir')
data = data[data.filter_col == "A"]
scaler = MinMaxScaler(features=feat_cols, prefix='prep_')
scaler.fit_transform(data)

When running the above code, I encounter the following error:

IndexError                                Traceback (most recent call last)
      3 scaler = MinMaxScaler(features=feat_cols, prefix='prep_')
----> 4 scaler.fit_transform(data)

File ~/conda/lib/python3.9/site-packages/vaex/ml/transformations.py:46, in Transformer.fit_transform(self, df)
     39 '''Fit and apply the transformer to the supplied DataFrame.
     40 
     41 :param df: A vaex DataFrame.
     42 
     43 :returns copy: A shallow copy of the DataFrame that includes the transformations.
     44 '''
     45 self.fit(df=df)
---> 46 return self.transform(df=df)

File ~/conda/lib/python3.9/site-packages/vaex/ml/transformations.py:719, in MinMaxScaler.transform(self, df)
    717     b = self.feature_range[1]
    718     expr = copy[feature]
--> 719     expr = (b-a)*(expr-self.fmin_[i])/(self.fmax_[i]-self.fmin_[i]) + a
    720     copy[name] = expr
    721 return copy

IndexError: list index out of range

The reason for this error is fmin_ and fmax_ being empty after calling fit. Normally, they should contain the minima and maxima of each column to be scaled.

However, when I remove the filter step, MinMaxScaler works as expected.

data = vaex.open('my_parquet_file_dir')
# data = data[data.filter_col == "A"]
scaler = MinMaxScaler(features=feat_cols, prefix='prep_')
scaler.fit_transform(data)

Software information

Additional information The dataset is distributed across 100 parquet files. The shape of the data is around 3M rows and 120 columns.

I tried to create a minmal dataset to reproduce the error but failed. Even when I create a dataset with similar properties like below, filtering and MinMaxScaler still work as expected.

import pandas as pd
import numpy as np
for i in range(100):
    data_dict = {f'col{c}': np.linspace(c, 100 * (c + 1), 30000) for c in range(120)}
    data_dict['filter_col'] = np.random.choice(['A', 'B'], 30000)
    data_pd = pd.DataFrame(data_dict)
    data_pd.to_parquet(f'my_parquet_file_dir/test{i}.parquet')