Description

There are rare cases where it's hard or impossible to trace the column names throughout pipelines, especially when certain sklearn feature selection transformers are used. That's why we use array as a column name in certain cases when we can't guarantee to always track column names successfully. But we should still try to do column-level tracking on a best effort basis.

For transformers like the OneHotEncoder, that can consume pandas.DataFrames with multiple columns and output a single numpy.ndarray, we need to pass the NumPy arrays to the inspections in a way, so they know which parts of the array correspond to which logical columns.

There are some performance considerations when splitting the NumPy arrays into multiple columns from an inspection perspective (we either need to extend the schema information or add some logic to the InspectionInputRow classes and use functions that return partial NumPy views of the original array). The second solution is likely preferable, but we would need to measure its performance overhead.

Having the correct column names is definitely useful. There are rare cases where it's almost impossible to do this tracking properly through different transformers, e.g., in this example:

from sklearn.feature_selection import  VarianceThreshold
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

X = pd.DataFrame.from_dict({
    'A': [0.87, -1.34,  0.31,  1.92],
    'B': [-1.34, -0.48, -2.55, 0.65],
    'C': [-1.34, -0.48, -2.55, 0.65],
    'D': [0, 0, 0, 0],
})
f_select = VarianceThreshold(threshold=(.8 * (1 - .8)))
standard_scaler = StandardScaler()
pipeline = Pipeline([("f_select", f_select), ("scaler", standard_scaler)])
X = ColumnTransformer([
    ("obscure_example", pipeline, ['B', 'C', 'D'])]

).fit_transform(X)
print(X)

If we have feature selection transformers or ones for dimensionality reduction, then it becomes very difficult to track the column names. So I don't think we can always guarantee to provide the correct column names through different transformers.

For transformers like the OneHot Encoder, it's possible to track which values get transformed to which one-hot vector, see, e.g., this part from before the rework. But there might be transformers or other operations where we lose this column-level tracking, e.g., if we want to support apply/map operations using user-defined functions. If the user-defined function returns a numpy array, then we need some fallback like the current array.

Because of code snippets like the one above we also updated the DAG a bit. We want to avoid having the DAG look different if only the data flowing through a pipeline, but not the code, changes. That's why we no longer duplicate transformers used as an argument of ColumnTransformer in the DAG like we did previously (we created a copy for each column that only sees the data of that particular column flowing through). The old DAG can be found here, the new one here.

As a result, column-level tracking becomes a bit more important. Before, the MaterializeFirstOutputRows inspection was able to capture this information in the healthcare example pipeline:

Healthcare example before control flow rework

Now, after these changes, it looks like this:

Healthcare example after control flow rework

This is why we should work on column-level tracking as one of the next topics.

stefan-grafberger / mlinspect

Best-effort column tracking #50

Description