Open Jeffwan opened 1 year ago
Hey @Jeffwan, thanks for these suggestions! I agree with both of them.
For outlier removal, are you imagining that we would essentially treat them as a "missing value" that gets replaced with the mean of the dataset, etc.? Or would you want to drop the entire row from the training data if it has outliers?
I think this could be implemented as a preprocessing strategy, like we do for missing values, so both replacing with mean or dropping could be valid options for the user to choose.
An open question is if one wats to do it at the individual feature level or not. For instance, if a dataset has 100 features and only one of these features has values that are considered outliers (for instance outside of the 99% of the probability mass), would we want to drop the datapoint? Or would at least a certain percentage of the features be needed for dropping the datapoint?
@jimthompson5802 another potential option as per or previous conversation about what to work on next
@Jeffwan for the colinearity part, I was imagining this could be done as part of the AutoML / type inference part of Ludwig. Is that what you were thinking as well?
One way we might try doing this would be something like:
We can also show the VIF computed as part of the result returned in the DatasetInfo
.
Does that sound reasonable to you?
For outliers, I'm imagining the config API could look something like:
preprocessing:
# defaults to null which means "use missing value strategy", can override with any missing value strategy
outlier_strategy: null
# defaults to 3 standard deviations from the mean, can be set to null which means don't replace outliers
outlier_threshold: 3.0
Open question if we'd want to default to enabling outlier removal or put it on the user to enable it.
Put together #3080 for outlier replacement, will test it out in a bit.
Hey @Jeffwan, I spent some time testing #3080 and it seems to be working as expected. Please take a look and let me know if it addresses the outlier handling scenario as you expect.
@w4nderlust @tgaddair I started looking at the collinear aspect of this request.
One way we might try doing this would be something like:
Calculate VIF score for every feature in the dataset. Remove feature with highest VIF > 10 Repeat (1) with remaining features until there are no features with VIF > 10 We can also show the VIF computed as part of the result returned in the DatasetInfo.
Given what you did for outlier, I'm assuming collinear should follow a similar approach. I'm thinking something like this.
preprocessing:
collinear_elimination_strategy: False | True | <float_value>
where
False
no collinear analysis (Default)True
perform collinear analysis and disable
any numeric input features that exceed the default threshold, e.g., 10<float_value>
perform collinear analysis and disable
any numeric input feature that exceed the user specified value.An alternative specifcation could be
preprocessing:
collinear_elimination_strategy: None | <float_value>
where
None
: no collinear analysis (Default)<float_value>
perform collinear analysis and disable
any numeric input feature that exceed the user specified value.looks like we can use statsmodel.outlier_influence.linear_inflation_factor()
function to do the computation for VIF
.
Did you want me to take a stab at this?
Thanks everyone for the quick reply!
@tgaddair
For outlier removal, are you imagining that we would essentially treat them as a "missing value" that gets replaced with the mean of the dataset, etc.? Or would you want to drop the entire row from the training data if it has outliers?
I think we can start from easier actions and expand the strategies. To our cases, I think dropping the row, replace with mean/avg all work.
for the colinearity part, I was imagining this could be done as part of the AutoML / type inference part of Ludwig. Is that what you were thinking as well?
yes. It makes sense.
Hey @Jeffwan, I spent some time testing #3080 and it seems to be working as expected. Please take a look and let me know if it addresses the outlier handling scenario as you expect.
Great. I will check it out.
@jimthompson5802 I personally think this interface make more sense. The 1st one mix boolean and numeric threshold together which may bring complexity on the configuration side. for example true implicitly apply the default threshold.
An alternative specifcation could be
preprocessing: collinear_elimination_strategy: None | <float_value>
where
None
: no collinear analysis (Default)<float_value>
perform collinear analysis anddisable
any numeric input feature that exceed the user specified value.
@w4nderlust
An open question is if one wats to do it at the individual feature level or not. For instance, if a dataset has 100 features and only one of these features has values that are considered outliers (for instance outside of the 99% of the probability mass), would we want to drop the datapoint? Or would at least a certain percentage of the features be needed for dropping the datapoint?
Yeah, this does happen sometimes. I think that's also why lots of people don't like to take the drop
action. At least multiple strategies could be given to users and user can make the decisions based on their characteristics of their datasets.
One thing I also like to check is preprocessing now is configured at the feature level and the granularity makes sense. If user want to apply the missing_value_strategy
to all numeric features, is there a way to do that? If not, do you think this is a valid request? Current way works for us, I am just curious whether other community users raise similar question?
One thing I also like to check is preprocessing now is configured at the feature level and the granularity makes sense. If user want to apply the
missing_value_strategy
to all numeric features, is there a way to do that? If not, do you think this is a valid request? Current way works for us, I am just curious whether other community users raise similar question?
That's a great question. You actually can work at the type level, specifying a behavior that is applied to all the features of a certain type.
So you can specify for instance:
defaults:
number:
preprocessing:
missing_value_strategy: drop
You also have the flexibility to iverride it. So if you have 10 number features and you want to drop missing values on 9 of fhem and fill with a constant for one of them you can specify the fill with const behavior for the one you want and the defaults section, the feature specific one iverrides the default one.
Here are more detailed docs: https://ludwig.ai/latest/configuration/defaults/
@arnavgarg1 can give you more details.
@w4nderlust @tgaddair I have a working prototype based on a pandas dataframe for removing collinear numeric features. Steps in the prototype are
make_regression()
# create collinear features using linear combinations of original columns
df_X.loc[:, "col_6"] = -3.0 * df_X.loc[:, "col_1"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_7"] = -4.0 * df_X.loc[:, "col_5"] + 5 * df_X.loc[:, "col_6"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_8"] = 10.0 * df_X.loc[:, "col_2"] + 3 * df_X.loc[:, "col_3"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_9"] = 5.0 * df_X.loc[:, "col_4"] + np.random.normal(0, 1, 10000)
Right now I'm thinking the best place to add this logic is after all preprocessing has occurred, i.e., missing values have been addressed.
If the following example seems reasonable, I'll start a PR.
Here is a sample run of the prototype. The added collinear features are col_6
, col_7
, col_8
and col_9
. With the exception of col_6
, the added columns were removed. In the case of col_6
, its related column col_1
was removed. I believe this is acceptable because the columns remaining are not collinear.
HEAD OF DATAFRAME ORIGINAL FEATURES:
col_0 col_1 col_2 col_3 col_4 col_5
0 1.492266 0.380105 -1.485707 -0.381965 1.198048 0.431193
1 -0.360056 -0.236285 1.122365 -0.036220 -1.483654 0.532064
2 0.457844 -1.576987 1.782573 -0.058940 -1.065687 1.312847
3 -0.430079 -0.751190 -0.866446 -0.697851 -0.240321 -0.775289
4 0.407790 1.280922 1.485749 -1.194632 1.343890 0.264181
HEAD OF DATAFRAME WITH COLLINEAR FEATURES:
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 1.492266 0.380105 -1.485707 -0.381965 1.198048 0.431193 -1.347910 -8.551905 -17.442840 6.367780
1 -0.360056 -0.236285 1.122365 -0.036220 -1.483654 0.532064 0.277738 -1.563001 10.580640 -7.541804
2 0.457844 -1.576987 1.782573 -0.058940 -1.065687 1.312847 5.972945 25.947229 16.510529 -5.329216
3 -0.430079 -0.751190 -0.866446 -0.697851 -0.240321 -0.775289 1.475014 9.736575 -10.712485 -2.418681
4 0.407790 1.280922 1.485749 -1.194632 1.343890 0.264181 -3.440817 -17.826868 11.057215 4.461193
Dropping collinear features...
vif: [1.0012512202968118, 10.198363498400512, 103.11155362341391, 10.093692217851554, 25.61925150270402, 16.933229184145596, 256.0398401734716, 261.930481103284, 112.56927143735385, 25.619454781343062]
dropping column col_7
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_8', 'col_9']
vif: [1.0009083395523444, 10.198130137591416, 103.0908484767733, 10.092532269457786, 25.616064153244235, 1.000366387550446, 10.195992927849973, 112.54859397937996, 25.616414667154395]
dropping column col_8
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_9']
vif: [1.0007214642714424, 10.197063171734436, 1.000445087205531, 1.0006454130211486, 25.61434967823435, 1.0003556305369556, 10.195872064183774, 25.61577622504558]
dropping column col_9
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
vif: [1.0005634361551237, 10.196539888313412, 1.000408864114451, 1.000515191160002, 1.000115316297773, 1.0003200845971914, 10.195208707123264]
dropping column col_1
remaining columns: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
vif: [1.0005488081388667, 1.0003561581146232, 1.0004327414316316, 1.0001143090802804, 1.0003154489825534, 1.0006916945916395]
Remaining variables: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
completed dropping collinear features...
non-collinear features: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
HEAD OF DATAFRAME WITH COLLINEAR FEATURES REMOVED:
col_0 col_2 col_3 col_4 col_5 col_6
0 1.492266 -1.485707 -0.381965 1.198048 0.431193 -1.347910
1 -0.360056 1.122365 -0.036220 -1.483654 0.532064 0.277738
2 0.457844 1.782573 -0.058940 -1.065687 1.312847 5.972945
3 -0.430079 -0.866446 -0.697851 -0.240321 -0.775289 1.475014
4 0.407790 1.485749 -1.194632 1.343890 0.264181 -3.440817
Here is the code for the prototype if interested.
import logging
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from statsmodels.stats.outliers_influence import variance_inflation_factor
# function to drop collinear features
# dataframe column names are proxy for input feature names
def drop_collinear_features(X, thresh=5.0):
# setup array to perform VIF calculation
columns = X.columns.to_list()
X_arr = X.loc[:, columns].values
array_indices = list(range(X_arr.shape[1]))
# loop until all collinear features are removed
found_all_collinear = False
while not found_all_collinear:
# compute VIF for each feature
vif = [variance_inflation_factor(X_arr, ix) for ix in array_indices]
print(f"\nvif: {vif}")
# if VIF score for a feature is above threshold, drop feature
if max(vif) > thresh:
# get index of feature with highest VIF score to drop
maxloc = vif.index(max(vif))
print(f'dropping column {columns[maxloc]}')
# drop feature from array and column list
del columns[maxloc]
print(f"remaining columns: {columns}")
# update array and indices
X_arr = X.loc[:, columns].values
array_indices = list(range(X_arr.shape[1]))
else:
# no collinear features found, exit loop
found_all_collinear = True
# all done, return non-collinear features
print(f'Remaining variables: {columns}')
return columns
# setup pandas output options width 100 and display all columns
pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 100)
if __name__ == "__main__":
# define synthetic regression dataset with 6 input features and 1 output feature
X, y = make_regression(
n_samples=10000, n_features=6, n_informative=6, n_targets=1,
random_state=1
)
df_X = pd.DataFrame(X)
df_X.columns = [f"col_{i}" for i in range(6)]
# show head of dataframe
print(f"\nHEAD OF DATAFRAME ORIGINAL FEATURES:\n{df_X.head()}")
# create collinear features using linear combinations of original columns
df_X.loc[:, "col_6"] = -3.0 * df_X.loc[:, "col_1"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_7"] = -4.0 * df_X.loc[:, "col_5"] + 5 * df_X.loc[:, "col_6"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_8"] = 10.0 * df_X.loc[:, "col_2"] + 3 * df_X.loc[:, "col_3"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_9"] = 5.0 * df_X.loc[:, "col_4"] + np.random.normal(0, 1, 10000)
# show head of dataframe
print(f"\nHEAD OF DATAFRAME WITH COLLINEAR FEATURES:\n{df_X.head()}")
# drop collinear features
print("\nDropping collinear features...")
non_collinear_features = drop_collinear_features(df_X)
print("completed dropping collinear features...")
print(f"non-collinear features: {non_collinear_features}")
# show head of dataframe
print(f"\nHEAD OF DATAFRAME WITH COLLINEAR FEATURES REMOVED:\n{df_X[non_collinear_features].head()}")
@Jeffwan Do you have a non-proprietary data set that you can share? If possible, I'd like to test out the prototype on some real data. Something with a few thousand rows.
Since I posted the prototype code and it just requires pandas and statsmodel packages, feel free to try it on one of your data sets if you can't share data outside of your org.
@jimthompson5802 this looks great! Few considerations for implementing it in Ludwig, maybe obvious but hey :) :
@w4nderlust thank you for the comments.
The process should check the type of column it's considering, as the concept of collinearity applies only to numerical features
This makes sense. I'll limit this to only numerical features.
I wonder if running the vif computation only once and using those scores is sufficient instead of recomputing it at every iteration
I was thinking of computing it only once. It is not clear to me why collinearity among the numeric only features would change as training progresses. Is there a situation I'm not considering?
if the previous comment is true, from a code structure point of view, I would implement it as: a function that computes all the vifs for a dataset, a function that given a datasets returns a set of true and false for each of the column of that dataset if they pass the threshold and a function that diven a dataset returns a dataset without the filtered column. This way each of these functions could be generally reusable instead of monolithic. Wdyt?
re: "returns a set of true and false for each column of that dataset" I was planning to implement a function that calls the disable()
method for the numeric input features configuration instead of returning true/false for each numeric feature. Of course, this is my first time using this new capability so I'm not sure if there are limitations on when and where I can call disable()
method for the input feature configuration. I'm reviewing the code base now to see if this will work.
If I can't use disable()
then returning true/false is the alternative. Then the question is what component will be responsible to leave out the collinear numeric features for training?
I wonder if running the vif computation only once and using those scores is sufficient instead of recomputing it at every iteration
I was thinking of computing it only once. It is not clear to me why collinearity among the numeric only features would change as training progresses. Is there a situation I'm not considering?
Sorry i was not clear enough. When i said iteration I meant this for loop in your code (not trainint iteration of a model):
while not found_all_collinear:
# compute VIF for each feature
vif = [variance_inflation_factor(X_arr, ix) for ix in array_indices]
So what I meant is that we could do that vif computation outside the loop, only once, and use the scores to assess which features to remove instead of recomputing it every time we remove a feature from the lot. Does the result of what features are removed change? does recomputing vif scores after you remove a feature make a difference?
if the previous comment is true, from a code structure point of view, I would implement it as: a function that computes all the vifs for a dataset, a function that given a datasets returns a set of true and false for each of the column of that dataset if they pass the threshold and a function that diven a dataset returns a dataset without the filtered column. This way each of these functions could be generally reusable instead of monolithic. Wdyt?
re: "returns a set of true and false for each column of that dataset" I was planning to implement a function that calls the
disable()
method for the numeric input features configuration instead of returning true/false for each numeric feature. Of course, this is my first time using this new capability so I'm not sure if there are limitations on when and where I can calldisable()
method for the input feature configuration. I'm reviewing the code base now to see if this will work.If I can't use
disable()
then returning true/false is the alternative. Then the question is what component will be responsible to leave out the collinear numeric features for training?
A disable(df) -> df
function is totally fine, but internally I imagine it calls a above_vif_threshold(df, threshold) -> List[bool]
which in turn calls a compute_vifs(df) -> List[float]
. Does this make it more clear? Obviously, this assumes you can compute allf the vifs at once like I was suggesting in the previous point, if vifs need to be recomputer every time we may need a different interface.
@w4nderlust Got it...thank you for the clarification.
re:
So what I meant is that we could do that vif computation outside the loop, only once, and use the scores to assess which features to remove instead of recomputing it every time we remove a feature from the lot. Does the result of what features are removed change? does recomputing vif scores after you remove a feature make a difference?
the prototype is based on the solution @tgaddair outlined in this post.
I think the solution outlined above makes sense. Consider this example if x_1
is a linear combination of two other features, e.g., x_1 = 2*x_2 - 4*x_3
, then all their vif scores will be "large values" exceeding the "threshold". We can't eliminate all three because I think we are going to eliminate a useful predictor. In this example, we may need to eliminate 1 or 2 of the three. Eliminate 1 if the other two are not collinear themselves. Eliminate 2 if two of the three are collinear as well.
You can see this in the prototype output. I've modified the prototype to show the evolution of each features VIF score as collinear features are eliminated. The equivalent of the above example involves, col_2
, col_3
and col_8
. Initially all three VIF scores are >> threshold. Once col_8
is eliminated the VIF scores for col_2
and col_3
drop to about to 1.00, under the threshold.
This is how the synthetic collinear features are created
# create collinear features using linear combinations of original columns
df_X.loc[:, "col_6"] = -3.0 * df_X.loc[:, "col_1"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_7"] = -4.0 * df_X.loc[:, "col_5"] + 5 * df_X.loc[:, "col_6"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_8"] = 10.0 * df_X.loc[:, "col_2"] + 3 * df_X.loc[:, "col_3"] + np.random.normal(0, 1, 10000)
df_X.loc[:, "col_9"] = 5.0 * df_X.loc[:, "col_4"] + np.random.normal(0, 1, 10000)
trace of eliminating collinear features.
Dropping collinear features...
vif values: [('col_0', '1.00068'), ('col_1', '9.98991'), ('col_2', '100.85145'), ('col_3', '9.87088'), ('col_4', '25.83659'), ('col_5', '17.16428'), ('col_6', '259.64115'), ('col_7', '265.95794'), ('col_8', '110.10045'), ('col_9', '25.84066')]
dropping column col_7
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_8', 'col_9']
vif values: [('col_0', '1.00067'), ('col_1', '9.98950'), ('col_2', '100.84102'), ('col_3', '9.86889'), ('col_4', '25.83449'), ('col_5', '1.00048'), ('col_6', '9.98640'), ('col_8', '110.08152'), ('col_9', '25.83819')]
dropping column col_8
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_9']
vif values: [('col_0', '1.00067'), ('col_1', '9.98826'), ('col_2', '1.00056'), ('col_3', '1.00062'), ('col_4', '25.83094'), ('col_5', '1.00047'), ('col_6', '9.98588'), ('col_9', '25.83275')]
dropping column col_9
remaining columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
vif values: [('col_0', '1.00058'), ('col_1', '9.98510'), ('col_2', '1.00055'), ('col_3', '1.00050'), ('col_4', '1.00012'), ('col_5', '1.00034'), ('col_6', '9.98409')]
dropping column col_1
remaining columns: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
vif values: [('col_0', '1.00058'), ('col_2', '1.00030'), ('col_3', '1.00047'), ('col_4', '1.00011'), ('col_5', '1.00034'), ('col_6', '1.00072')]
Remaining variables: ['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
completed dropping collinear features...
non-collinear features:
['col_0', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
@tgaddair @w4nderlust Looking over some of the earlier discussions, I believe I may have gone down a different path on how and when to eliminate the collinear features. Initially, I was thinking the following steps:
disable()
method, which set's the feature's active
flag to False
.active
flag is True
.re-reading this post,
... this could be done as part of the AutoML / type inference part of Ludwig. Is that what you were thinking as well?
One way we might try doing this would be something like:
- Calculate VIF score for every feature in the dataset.
- Remove feature with highest VIF > 10
- Repeat (1) with remaining features until there are no features with VIF > 10
We can also show the VIF computed as part of the result returned in the
DatasetInfo
.
it appears the idea is to have collinear elimination be part of AutoML processing, i.e., use the VIF score to create the model configuration file on the fly with the collinear features eliminated? If this is true, then I'm probably executing the VIF computation and feature elimination at the wrong point of the process.
Let me know how I should think about this.
@jimthompson5802 thanks for the detailed post! I understand it and agree with the conclusion, we should do elimination one at a time. My proposal was wrong. Also after re-reading your code, it returns the columns and does not edit the actual original dataframe but recomputes the view at each iteration, which is perfect in my mind. Now we can think of what's the right place to put it. Curious of @tgaddair opinion too, but I believe that automl could be a good place as I can imagine someone figuring out which features are collinear and deciding to disable them before even obtaining a Ludwig config. Although if we have a function that is generic anough, we could also have a "remove collinear" parameter in the preproc section that performs the same computation and both removes columns from the in memory df and removes sections from the config as a consequence (after each column is processed and before the model is bult). What do you think?
@w4nderlust @tgaddair @Jeffwan I just submitted PR https://github.com/ludwig-ai/ludwig/pull/3121 for the collinear detection portion of this issue. Currently the PR is in DRAFT status.
We should move discussion of this topic to the PR.
Is your feature request related to a problem? Please describe. I checked nunmber features preprocessing page https://ludwig.ai/latest/configuration/features/number_features/ and can not find following features.
Describe the use case Removing outliers increases the accuracy of the model. It is also common topics of feature engineering and should be supported. Remove Collinear Features improves the model explainability. We know correct feature contribution.
Describe the solution you'd like I expect more options could be exposed in the configuration so user could configure it.
Describe alternatives you've considered N/A
Additional context N/A