inventec-ai-center / bp-benchmark

A Benchmark for Machine-Learning based Non-Invasive Blood Pressure Estimation using Photoplethysmogram
https://doi.org/10.1038/s41597-023-02020-6
MIT License
45 stars 17 forks source link

How was feature importance generated? #8

Closed ms-keliu closed 5 months ago

ms-keliu commented 5 months ago

Could you please share the process of generating feature importance in Feat2lab models for reproduction? I can only find feature importance result under the root folder. Thank you for your attention:)

sergiogvz commented 5 months ago

The process of generating the feature importance is described in our paper as follows:

Given a large number of features, we conduct feature selection based on tree-based ensembles. We train fully-grown RF[38] and Extra-Trees[52] with 500 trees independently for SBP and DBP. The feature importance is the normalized mean decrease of the Gini impurity achieved across the ensembles. Thus, the features sorted by their importance can be selected by a hyperparameter of the percentage of desired features.

I am afraid that we haven't kept the specific code. However, this procedure could be reproduced following the above description. I have drafted some code:

import numpy as np
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.preprocessing import MinMaxScaler

def importance_RF_ET(X, y, trees=500, random_state=123456):
    ET = ExtraTreesRegressor(n_estimators=trees,n_jobs=-1, random_state=random_state)
    ET.fit(X, y)
    RF = RandomForestRegressor(n_estimators=trees,n_jobs=-1, random_state=random_state)
    RF.fit(X, y)
    imp = (RF.feature_importances_ + ET.feature_importances_)/2
    return imp

# Load the data
df = ...

# Normalize the data
y_SP_f = df.SP
y_DP_f = df.DP
df_X = df.drop(columns=['patient','record','trial','SP','DP'])
X = np.nan_to_num(df_X.values)
X = MinMaxScaler().fit_transform(X)

# Compute feature importance for SP and DP
# Note: We don't care about the model performance. We want to bring the models to overfitting to get as many splits as possible.
imp_SP = importance_RF_ET(X, y_SP_f)
imp_DP = importance_RF_ET(X, y_DP_f)

# Sort the feature importance
df_imp_SP = pd.DataFrame({'features':df_X.columns,'importance':imp_SP}).sort_values('importance',ascending=False).reset_index(drop=True)
df_imp_DP = pd.DataFrame({'features':df_X.columns,'importance':imp_DP}).sort_values('importance',ascending=False).reset_index(drop=True)