How was feature importance generated?

The process of generating the feature importance is described in our paper as follows:

Given a large number of features, we conduct feature selection based on tree-based ensembles. We train fully-grown RF[38] and Extra-Trees[52] with 500 trees independently for SBP and DBP. The feature importance is the normalized mean decrease of the Gini impurity achieved across the ensembles. Thus, the features sorted by their importance can be selected by a hyperparameter of the percentage of desired features.

I am afraid that we haven't kept the specific code. However, this procedure could be reproduced following the above description. I have drafted some code:

import numpy as np
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.preprocessing import MinMaxScaler

def importance_RF_ET(X, y, trees=500, random_state=123456):
    ET = ExtraTreesRegressor(n_estimators=trees,n_jobs=-1, random_state=random_state)
    ET.fit(X, y)
    RF = RandomForestRegressor(n_estimators=trees,n_jobs=-1, random_state=random_state)
    RF.fit(X, y)
    imp = (RF.feature_importances_ + ET.feature_importances_)/2
    return imp

# Load the data
df = ...

# Normalize the data
y_SP_f = df.SP
y_DP_f = df.DP
df_X = df.drop(columns=['patient','record','trial','SP','DP'])
X = np.nan_to_num(df_X.values)
X = MinMaxScaler().fit_transform(X)

# Compute feature importance for SP and DP
# Note: We don't care about the model performance. We want to bring the models to overfitting to get as many splits as possible.
imp_SP = importance_RF_ET(X, y_SP_f)
imp_DP = importance_RF_ET(X, y_DP_f)

# Sort the feature importance
df_imp_SP = pd.DataFrame({'features':df_X.columns,'importance':imp_SP}).sort_values('importance',ascending=False).reset_index(drop=True)
df_imp_DP = pd.DataFrame({'features':df_X.columns,'importance':imp_DP}).sort_values('importance',ascending=False).reset_index(drop=True)

inventec-ai-center / bp-benchmark

How was feature importance generated? #8