ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.12k stars 1.19k forks source link

Feature preprocessing support - remove outliers and collinear features #3078

Open Jeffwan opened 1 year ago

Jeffwan commented 1 year ago

Is your feature request related to a problem? Please describe. I checked nunmber features preprocessing page https://ludwig.ai/latest/configuration/features/number_features/ and can not find following features.

Describe the use case Removing outliers increases the accuracy of the model. It is also common topics of feature engineering and should be supported. Remove Collinear Features improves the model explainability. We know correct feature contribution.

Describe the solution you'd like I expect more options could be exposed in the configuration so user could configure it.

Describe alternatives you've considered N/A

Additional context N/A

tgaddair commented 1 year ago

Hey @Jeffwan, thanks for these suggestions! I agree with both of them.

For outlier removal, are you imagining that we would essentially treat them as a "missing value" that gets replaced with the mean of the dataset, etc.? Or would you want to drop the entire row from the training data if it has outliers?

w4nderlust commented 1 year ago

I think this could be implemented as a preprocessing strategy, like we do for missing values, so both replacing with mean or dropping could be valid options for the user to choose.

An open question is if one wats to do it at the individual feature level or not. For instance, if a dataset has 100 features and only one of these features has values that are considered outliers (for instance outside of the 99% of the probability mass), would we want to drop the datapoint? Or would at least a certain percentage of the features be needed for dropping the datapoint?

w4nderlust commented 1 year ago

@jimthompson5802 another potential option as per or previous conversation about what to work on next

tgaddair commented 1 year ago

@Jeffwan for the colinearity part, I was imagining this could be done as part of the AutoML / type inference part of Ludwig. Is that what you were thinking as well?

One way we might try doing this would be something like:

  1. Calculate VIF score for every feature in the dataset.
  2. Remove feature with highest VIF > 10
  3. Repeat (1) with remaining features until there are no features with VIF > 10

We can also show the VIF computed as part of the result returned in the DatasetInfo.

Does that sound reasonable to you?

tgaddair commented 1 year ago

For outliers, I'm imagining the config API could look something like:

preprocessing:
  # defaults to null which means "use missing value strategy", can override with any missing value strategy
  outlier_strategy: null  

  # defaults to 3 standard deviations from the mean, can be set to null which means don't replace outliers
  outlier_threshold: 3.0 

Open question if we'd want to default to enabling outlier removal or put it on the user to enable it.

tgaddair commented 1 year ago

Put together #3080 for outlier replacement, will test it out in a bit.

tgaddair commented 1 year ago

Hey @Jeffwan, I spent some time testing #3080 and it seems to be working as expected. Please take a look and let me know if it addresses the outlier handling scenario as you expect.

jimthompson5802 commented 1 year ago

@w4nderlust @tgaddair I started looking at the collinear aspect of this request.

One way we might try doing this would be something like:

Calculate VIF score for every feature in the dataset. Remove feature with highest VIF > 10 Repeat (1) with remaining features until there are no features with VIF > 10 We can also show the VIF computed as part of the result returned in the DatasetInfo.

Given what you did for outlier, I'm assuming collinear should follow a similar approach. I'm thinking something like this.

preprocessing:
  collinear_elimination_strategy: False | True | <float_value>

where

An alternative specifcation could be

preprocessing:
  collinear_elimination_strategy: None | <float_value>

where

looks like we can use statsmodel.outlier_influence.linear_inflation_factor() function to do the computation for VIF.

Did you want me to take a stab at this?

Jeffwan commented 1 year ago

Thanks everyone for the quick reply!

@tgaddair

For outlier removal, are you imagining that we would essentially treat them as a "missing value" that gets replaced with the mean of the dataset, etc.? Or would you want to drop the entire row from the training data if it has outliers?

I think we can start from easier actions and expand the strategies. To our cases, I think dropping the row, replace with mean/avg all work.

for the colinearity part, I was imagining this could be done as part of the AutoML / type inference part of Ludwig. Is that what you were thinking as well?

yes. It makes sense.

Jeffwan commented 1 year ago

Hey @Jeffwan, I spent some time testing #3080 and it seems to be working as expected. Please take a look and let me know if it addresses the outlier handling scenario as you expect.

Great. I will check it out.

Jeffwan commented 1 year ago

@jimthompson5802 I personally think this interface make more sense. The 1st one mix boolean and numeric threshold together which may bring complexity on the configuration side. for example true implicitly apply the default threshold.

An alternative specifcation could be

preprocessing:
  collinear_elimination_strategy: None | <float_value>

where

  • None: no collinear analysis (Default)
  • <float_value> perform collinear analysis and disable any numeric input feature that exceed the user specified value.
Jeffwan commented 1 year ago

@w4nderlust

An open question is if one wats to do it at the individual feature level or not. For instance, if a dataset has 100 features and only one of these features has values that are considered outliers (for instance outside of the 99% of the probability mass), would we want to drop the datapoint? Or would at least a certain percentage of the features be needed for dropping the datapoint?

Yeah, this does happen sometimes. I think that's also why lots of people don't like to take the drop action. At least multiple strategies could be given to users and user can make the decisions based on their characteristics of their datasets.

Jeffwan commented 1 year ago

One thing I also like to check is preprocessing now is configured at the feature level and the granularity makes sense. If user want to apply the missing_value_strategy to all numeric features, is there a way to do that? If not, do you think this is a valid request? Current way works for us, I am just curious whether other community users raise similar question?

w4nderlust commented 1 year ago

One thing I also like to check is preprocessing now is configured at the feature level and the granularity makes sense. If user want to apply the missing_value_strategy to all numeric features, is there a way to do that? If not, do you think this is a valid request? Current way works for us, I am just curious whether other community users raise similar question?

That's a great question. You actually can work at the type level, specifying a behavior that is applied to all the features of a certain type.

So you can specify for instance:

defaults:
  number:
    preprocessing:
      missing_value_strategy: drop

You also have the flexibility to iverride it. So if you have 10 number features and you want to drop missing values on 9 of fhem and fill with a constant for one of them you can specify the fill with const behavior for the one you want and the defaults section, the feature specific one iverrides the default one.

Here are more detailed docs: https://ludwig.ai/latest/configuration/defaults/

@arnavgarg1 can give you more details.

jimthompson5802 commented 1 year ago

@w4nderlust @tgaddair I have a working prototype based on a pandas dataframe for removing collinear numeric features. Steps in the prototype are

  1. create pandas data frame with sklearn make_regression()
  2. create collinear features that are linear combinations of features created in step 1
    # create collinear features using linear combinations of original columns
    df_X.loc[:, "col_6"] = -3.0 * df_X.loc[:, "col_1"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_7"] = -4.0 * df_X.loc[:, "col_5"] + 5 * df_X.loc[:, "col_6"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_8"] = 10.0 * df_X.loc[:, "col_2"] + 3 * df_X.loc[:, "col_3"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_9"] = 5.0 * df_X.loc[:, "col_4"] + np.random.normal(0, 1, 10000)
  3. perform the steps @tgaddair described in this post
  4. Show dataframe with collinear features removed.

Right now I'm thinking the best place to add this logic is after all preprocessing has occurred, i.e., missing values have been addressed.

If the following example seems reasonable, I'll start a PR.

Here is a sample run of the prototype. The added collinear features are col_6, col_7, col_8 and col_9. With the exception of col_6, the added columns were removed. In the case of col_6, its related column col_1 was removed. I believe this is acceptable because the columns remaining are not collinear.

HEAD OF DATAFRAME ORIGINAL FEATURES:
      col_0     col_1     col_2     col_3     col_4     col_5
0  1.492266  0.380105 -1.485707 -0.381965  1.198048  0.431193
1 -0.360056 -0.236285  1.122365 -0.036220 -1.483654  0.532064
2  0.457844 -1.576987  1.782573 -0.058940 -1.065687  1.312847
3 -0.430079 -0.751190 -0.866446 -0.697851 -0.240321 -0.775289
4  0.407790  1.280922  1.485749 -1.194632  1.343890  0.264181

HEAD OF DATAFRAME WITH COLLINEAR FEATURES:
      col_0     col_1     col_2     col_3     col_4     col_5     col_6      col_7      col_8     col_9
0  1.492266  0.380105 -1.485707 -0.381965  1.198048  0.431193 -1.347910  -8.551905 -17.442840  6.367780
1 -0.360056 -0.236285  1.122365 -0.036220 -1.483654  0.532064  0.277738  -1.563001  10.580640 -7.541804
2  0.457844 -1.576987  1.782573 -0.058940 -1.065687  1.312847  5.972945  25.947229  16.510529 -5.329216
3 -0.430079 -0.751190 -0.866446 -0.697851 -0.240321 -0.775289  1.475014   9.736575 -10.712485 -2.418681
4  0.407790  1.280922  1.485749 -1.194632  1.343890  0.264181 -3.440817 -17.826868  11.057215  4.461193

Dropping collinear features...

vif: [1.0012512202968118, 10.198363498400512, 103.11155362341391, 10.093692217851554, 25.61925150270402, 16.933229184145596, 256.0398401734716, 261.930481103284, 112.56927143735385, 25.619454781343062]
dropping column col_7
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_8', 'col_9']

vif: [1.0009083395523444, 10.198130137591416, 103.0908484767733, 10.092532269457786, 25.616064153244235, 1.000366387550446, 10.195992927849973, 112.54859397937996, 25.616414667154395]
dropping column col_8
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_9']

vif: [1.0007214642714424, 10.197063171734436, 1.000445087205531, 1.0006454130211486, 25.61434967823435, 1.0003556305369556, 10.195872064183774, 25.61577622504558]
dropping column col_9
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']

vif: [1.0005634361551237, 10.196539888313412, 1.000408864114451, 1.000515191160002, 1.000115316297773, 1.0003200845971914, 10.195208707123264]
dropping column col_1
remaining columns: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']

vif: [1.0005488081388667, 1.0003561581146232, 1.0004327414316316, 1.0001143090802804, 1.0003154489825534, 1.0006916945916395]
Remaining variables: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
completed dropping collinear features...
non-collinear features: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']

HEAD OF DATAFRAME WITH COLLINEAR FEATURES REMOVED:
      col_0     col_2     col_3     col_4     col_5     col_6
0  1.492266 -1.485707 -0.381965  1.198048  0.431193 -1.347910
1 -0.360056  1.122365 -0.036220 -1.483654  0.532064  0.277738
2  0.457844  1.782573 -0.058940 -1.065687  1.312847  5.972945
3 -0.430079 -0.866446 -0.697851 -0.240321 -0.775289  1.475014
4  0.407790  1.485749 -1.194632  1.343890  0.264181 -3.440817

Here is the code for the prototype if interested.

import logging

import pandas as pd
import numpy as np

from sklearn.datasets import make_regression
from statsmodels.stats.outliers_influence import variance_inflation_factor

# function to drop collinear features
# dataframe column names are proxy for input feature names
def drop_collinear_features(X, thresh=5.0):
    # setup array to perform VIF calculation
    columns = X.columns.to_list()
    X_arr = X.loc[:, columns].values
    array_indices = list(range(X_arr.shape[1]))

    # loop until all collinear features are removed
    found_all_collinear = False
    while not found_all_collinear:
        # compute VIF for each feature
        vif = [variance_inflation_factor(X_arr, ix) for ix in array_indices]
        print(f"\nvif: {vif}")

        # if VIF score for a feature is above threshold, drop feature
        if max(vif) > thresh:
            # get index of feature with highest VIF score to drop
            maxloc = vif.index(max(vif))
            print(f'dropping column {columns[maxloc]}')

            # drop feature from array and column list
            del columns[maxloc]

            print(f"remaining columns: {columns}")
            # update array and indices
            X_arr = X.loc[:, columns].values
            array_indices = list(range(X_arr.shape[1]))
        else:
            # no collinear features found, exit loop
            found_all_collinear = True

    # all done, return non-collinear features
    print(f'Remaining variables: {columns}')
    return columns

# setup pandas output options width 100 and display all columns
pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 100)

if __name__ == "__main__":
    # define synthetic regression dataset with 6 input features and 1 output feature
    X, y = make_regression(
        n_samples=10000, n_features=6, n_informative=6, n_targets=1,
        random_state=1
    )
    df_X = pd.DataFrame(X)
    df_X.columns = [f"col_{i}" for i in range(6)]

    # show head of dataframe
    print(f"\nHEAD OF DATAFRAME ORIGINAL FEATURES:\n{df_X.head()}")

    # create collinear features using linear combinations of original columns
    df_X.loc[:, "col_6"] = -3.0 * df_X.loc[:, "col_1"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_7"] = -4.0 * df_X.loc[:, "col_5"] + 5 * df_X.loc[:, "col_6"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_8"] = 10.0 * df_X.loc[:, "col_2"] + 3 * df_X.loc[:, "col_3"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_9"] = 5.0 * df_X.loc[:, "col_4"] + np.random.normal(0, 1, 10000)

    # show head of dataframe
    print(f"\nHEAD OF DATAFRAME WITH COLLINEAR FEATURES:\n{df_X.head()}")

    # drop collinear features
    print("\nDropping collinear features...")
    non_collinear_features = drop_collinear_features(df_X)
    print("completed dropping collinear features...")
    print(f"non-collinear features: {non_collinear_features}")

    # show head of dataframe
    print(f"\nHEAD OF DATAFRAME WITH COLLINEAR FEATURES REMOVED:\n{df_X[non_collinear_features].head()}")
jimthompson5802 commented 1 year ago

@Jeffwan Do you have a non-proprietary data set that you can share? If possible, I'd like to test out the prototype on some real data. Something with a few thousand rows.

Since I posted the prototype code and it just requires pandas and statsmodel packages, feel free to try it on one of your data sets if you can't share data outside of your org.

tgaddair commented 1 year ago

3080 has landed and will be included in the v0.7 release.

w4nderlust commented 1 year ago

@jimthompson5802 this looks great! Few considerations for implementing it in Ludwig, maybe obvious but hey :) :

jimthompson5802 commented 1 year ago

@w4nderlust thank you for the comments.

The process should check the type of column it's considering, as the concept of collinearity applies only to numerical features

This makes sense. I'll limit this to only numerical features.

I wonder if running the vif computation only once and using those scores is sufficient instead of recomputing it at every iteration

I was thinking of computing it only once. It is not clear to me why collinearity among the numeric only features would change as training progresses. Is there a situation I'm not considering?

if the previous comment is true, from a code structure point of view, I would implement it as: a function that computes all the vifs for a dataset, a function that given a datasets returns a set of true and false for each of the column of that dataset if they pass the threshold and a function that diven a dataset returns a dataset without the filtered column. This way each of these functions could be generally reusable instead of monolithic. Wdyt?

re: "returns a set of true and false for each column of that dataset" I was planning to implement a function that calls the disable() method for the numeric input features configuration instead of returning true/false for each numeric feature. Of course, this is my first time using this new capability so I'm not sure if there are limitations on when and where I can call disable() method for the input feature configuration. I'm reviewing the code base now to see if this will work.

If I can't use disable() then returning true/false is the alternative. Then the question is what component will be responsible to leave out the collinear numeric features for training?

w4nderlust commented 1 year ago

I wonder if running the vif computation only once and using those scores is sufficient instead of recomputing it at every iteration

I was thinking of computing it only once. It is not clear to me why collinearity among the numeric only features would change as training progresses. Is there a situation I'm not considering?

Sorry i was not clear enough. When i said iteration I meant this for loop in your code (not trainint iteration of a model):

while not found_all_collinear:
        # compute VIF for each feature
        vif = [variance_inflation_factor(X_arr, ix) for ix in array_indices]

So what I meant is that we could do that vif computation outside the loop, only once, and use the scores to assess which features to remove instead of recomputing it every time we remove a feature from the lot. Does the result of what features are removed change? does recomputing vif scores after you remove a feature make a difference?

if the previous comment is true, from a code structure point of view, I would implement it as: a function that computes all the vifs for a dataset, a function that given a datasets returns a set of true and false for each of the column of that dataset if they pass the threshold and a function that diven a dataset returns a dataset without the filtered column. This way each of these functions could be generally reusable instead of monolithic. Wdyt?

re: "returns a set of true and false for each column of that dataset" I was planning to implement a function that calls the disable() method for the numeric input features configuration instead of returning true/false for each numeric feature. Of course, this is my first time using this new capability so I'm not sure if there are limitations on when and where I can call disable() method for the input feature configuration. I'm reviewing the code base now to see if this will work.

If I can't use disable() then returning true/false is the alternative. Then the question is what component will be responsible to leave out the collinear numeric features for training?

A disable(df) -> df function is totally fine, but internally I imagine it calls a above_vif_threshold(df, threshold) -> List[bool] which in turn calls a compute_vifs(df) -> List[float]. Does this make it more clear? Obviously, this assumes you can compute allf the vifs at once like I was suggesting in the previous point, if vifs need to be recomputer every time we may need a different interface.

jimthompson5802 commented 1 year ago

@w4nderlust Got it...thank you for the clarification.

re:

So what I meant is that we could do that vif computation outside the loop, only once, and use the scores to assess which features to remove instead of recomputing it every time we remove a feature from the lot. Does the result of what features are removed change? does recomputing vif scores after you remove a feature make a difference?

the prototype is based on the solution @tgaddair outlined in this post.

I think the solution outlined above makes sense. Consider this example if x_1 is a linear combination of two other features, e.g., x_1 = 2*x_2 - 4*x_3, then all their vif scores will be "large values" exceeding the "threshold". We can't eliminate all three because I think we are going to eliminate a useful predictor. In this example, we may need to eliminate 1 or 2 of the three. Eliminate 1 if the other two are not collinear themselves. Eliminate 2 if two of the three are collinear as well.

You can see this in the prototype output. I've modified the prototype to show the evolution of each features VIF score as collinear features are eliminated. The equivalent of the above example involves, col_2, col_3 and col_8. Initially all three VIF scores are >> threshold. Once col_8 is eliminated the VIF scores for col_2 and col_3 drop to about to 1.00, under the threshold.

This is how the synthetic collinear features are created

    # create collinear features using linear combinations of original columns
    df_X.loc[:, "col_6"] = -3.0 * df_X.loc[:, "col_1"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_7"] = -4.0 * df_X.loc[:, "col_5"] + 5 * df_X.loc[:, "col_6"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_8"] = 10.0 * df_X.loc[:, "col_2"] + 3 * df_X.loc[:, "col_3"] + np.random.normal(0, 1, 10000)
    df_X.loc[:, "col_9"] = 5.0 * df_X.loc[:, "col_4"] + np.random.normal(0, 1, 10000)

trace of eliminating collinear features.

Dropping collinear features...

vif values: [('col_0', '1.00068'), ('col_1', '9.98991'), ('col_2', '100.85145'), ('col_3', '9.87088'), ('col_4', '25.83659'), ('col_5', '17.16428'), ('col_6', '259.64115'), ('col_7', '265.95794'), ('col_8', '110.10045'), ('col_9', '25.84066')]
dropping column col_7
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_8', 'col_9']

vif values: [('col_0', '1.00067'), ('col_1', '9.98950'), ('col_2', '100.84102'), ('col_3', '9.86889'), ('col_4', '25.83449'), ('col_5', '1.00048'), ('col_6', '9.98640'), ('col_8', '110.08152'), ('col_9', '25.83819')]
dropping column col_8
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_9']

vif values: [('col_0', '1.00067'), ('col_1', '9.98826'), ('col_2', '1.00056'), ('col_3', '1.00062'), ('col_4', '25.83094'), ('col_5', '1.00047'), ('col_6', '9.98588'), ('col_9', '25.83275')]
dropping column col_9
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']

vif values: [('col_0', '1.00058'), ('col_1', '9.98510'), ('col_2', '1.00055'), ('col_3', '1.00050'), ('col_4', '1.00012'), ('col_5', '1.00034'), ('col_6', '9.98409')]
dropping column col_1
remaining columns: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']

vif values: [('col_0', '1.00058'), ('col_2', '1.00030'), ('col_3', '1.00047'), ('col_4', '1.00011'), ('col_5', '1.00034'), ('col_6', '1.00072')]
Remaining variables: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
completed dropping collinear features...
non-collinear features:
['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
jimthompson5802 commented 1 year ago

@tgaddair @w4nderlust Looking over some of the earlier discussions, I believe I may have gone down a different path on how and when to eliminate the collinear features. Initially, I was thinking the following steps:

re-reading this post,

... this could be done as part of the AutoML / type inference part of Ludwig. Is that what you were thinking as well?

One way we might try doing this would be something like:

  1. Calculate VIF score for every feature in the dataset.
  2. Remove feature with highest VIF > 10
  3. Repeat (1) with remaining features until there are no features with VIF > 10

We can also show the VIF computed as part of the result returned in the DatasetInfo.

it appears the idea is to have collinear elimination be part of AutoML processing, i.e., use the VIF score to create the model configuration file on the fly with the collinear features eliminated? If this is true, then I'm probably executing the VIF computation and feature elimination at the wrong point of the process.

Let me know how I should think about this.

w4nderlust commented 1 year ago

@jimthompson5802 thanks for the detailed post! I understand it and agree with the conclusion, we should do elimination one at a time. My proposal was wrong. Also after re-reading your code, it returns the columns and does not edit the actual original dataframe but recomputes the view at each iteration, which is perfect in my mind. Now we can think of what's the right place to put it. Curious of @tgaddair opinion too, but I believe that automl could be a good place as I can imagine someone figuring out which features are collinear and deciding to disable them before even obtaining a Ludwig config. Although if we have a function that is generic anough, we could also have a "remove collinear" parameter in the preproc section that performs the same computation and both removes columns from the in memory df and removes sections from the config as a consequence (after each column is processed and before the model is bult). What do you think?

jimthompson5802 commented 1 year ago

@w4nderlust @tgaddair @Jeffwan I just submitted PR https://github.com/ludwig-ai/ludwig/pull/3121 for the collinear detection portion of this issue. Currently the PR is in DRAFT status.

We should move discussion of this topic to the PR.