This pull request adds a random forest algorithm utilizing features from the Sine Coulomb Matrix and MagPie featurization algorithms. Here are the key details of the algorithm:
Sine Coulomb Matrix: Creates structural features based on Coulombic interactions within a periodic boundary condition (suitable for crystalline materials with known structures).
MagPie Features: Weighted elemental features derived from elemental data such as electronegativity, melting point, and electron affinity.
Both algorithms were executed within the Automatminer v1.0.3.20191111 framework for convenience, although no auto-featurization or AutoML processes were applied.
Data Processing
Data Cleaning: Features with more than 1% NaN samples were dropped. Missing samples were imputed using the mean of the training data.
Featurization:
For structure problems: Both Sine Coulomb Matrix and MagPie features were applied.
For problems without structure: Only MagPie features were applied.
Model Details
Random Forest: Utilizes 500 estimators.
Hyperparameter Tuning: None performed. A large, constant number of trees were used in constructing each fold's model, using the entire training+validation set as training data for the random forest.
Additional Information
Raw Data and Example Notebook: Available on the matbench repository.
Description
This pull request adds a random forest algorithm utilizing features from the Sine Coulomb Matrix and MagPie featurization algorithms. Here are the key details of the algorithm:
Sine Coulomb Matrix: Creates structural features based on Coulombic interactions within a periodic boundary condition (suitable for crystalline materials with known structures).
MagPie Features: Weighted elemental features derived from elemental data such as electronegativity, melting point, and electron affinity.
Both algorithms were executed within the Automatminer v1.0.3.20191111 framework for convenience, although no auto-featurization or AutoML processes were applied.
Data Processing
Data Cleaning: Features with more than 1% NaN samples were dropped. Missing samples were imputed using the mean of the training data.
Featurization:
For structure problems: Both Sine Coulomb Matrix and MagPie features were applied.
For problems without structure: Only MagPie features were applied.
Model Details
Random Forest: Utilizes 500 estimators.
Hyperparameter Tuning: None performed. A large, constant number of trees were used in constructing each fold's model, using the entire training+validation set as training data for the random forest.
Additional Information
Raw Data and Example Notebook: Available on the matbench repository.
Included files