sparks-baird / matsci-opt-benchmarks

A collection of benchmarking problems and datasets for testing the performance of advanced optimization algorithms in the field of materials science and chemistry.
https://matsci-opt-benchmarks.readthedocs.io/
MIT License
10 stars 1 forks source link

completed 1.2 notebook, created initial surrogate models, updated graphs #21

Closed jeet-parikh closed 1 year ago

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 4300208892


Files with Coverage Reduction New Missed Lines %
src/matsci_opt_benchmarks/particle_packing/utils/packing_generation.py 7 67.32%
src/matsci_opt_benchmarks/particle_packing/core.py 20 0%
src/matsci_opt_benchmarks/particle_packing/utils/ax.py 86 0%
<!-- Total: 113 -->
Totals Coverage Status
Change from base Build 4214506172: 0.04%
Covered Lines: 89
Relevant Lines: 665

💛 - Coveralls
sgbaird commented 1 year ago

Hey @jeet-parikh, looks good!

A few things to touch up.

Use OneHotEncoder to create new columns when there are multiple options, rather than doing an ordinal encoding.

image

The objectives shouldn't be mixed into the features (otherwise, it's giving the model the answer as one of the columns). The rank variable should be added for each of the regressors.

See https://github.com/sparks-baird/matsci-opt-benchmarks/blob/main/notebooks/particle_packing/1.2-ri-surrogate.ipynb for an updated way of handling the cross validation (using GroupKFold).

image

Later, when making the group_array-s, you can use something like the following:

sobol_reg_fba_group = (
        sobol_reg_fba[fba_features]
        .round(6)
        .apply(lambda row: "_".join(row.values.astype(str)), axis=1)
    )

instead of: image

so that it's a bit less verbose.

The point of using GroupKFold is to prevent data leakage (in this case, where the repeat runs would get mixed between the training and test sets).