A collection of benchmarking problems and datasets for testing the performance of advanced optimization algorithms in the field of materials science and chemistry for a variety of "hard" problems involving one or several of: constraints, heteroskedasticity, multiple objectives, multiple fidelities, and high-dimensionality.
There are already materials-science-specific resources related to datasets, surrogate models, and benchmarks out there:
In March 2021, pymatgen reorganized the
code into namespace
packages,
which makes it easier to distribute a collection of related subpackages and modules
under an umbrella project. Tangent to that, PyScaffold is a project generator for high-quality Python
packages, ready to be shared on PyPI and installable via pip; coincidentally,
it also supports namespace package configurations. Our plan for this
repository is to host
pip
-installable packages that allow for loading datasets, surrogate
models, and benchmarks for recent manuscripts from the Sparks group. We will look into hosting the datasets via Foundry and
using the surrogate model API via Olympus. We will likely do logging to a
MongoDB
database via MongoDB Atlas and later take a snapshot of
the dataset for Foundry. Initially, we plan to use a basic scikit-learn model, such
as
RandomForestRegressor
or GradientBoostingRegressor,
along with cross-validated hyperparameter optimization via
RandomizedSearchCV
or
HalvingRandomSearchCV
for the surrogate model.
What will really differentiate the contribution of this repository is the modeling of non-Gaussian, heteroskedastic noise, where the noise can be a complex function of the input parameters. This is contrasted with Gaussian homoskedastic noise, where the noise for a given parameter is both Gaussian and fixed [Wikipedia].
The goal is to win a "Turing test" of sorts for the surrogate model, where the model is indistinguishable from the true, underlying objective function.
To accomplish this, we plan to:
Our plans for implementation include:
pip install matsci-opt-benchmarks
Not implemented yet
from matsci_opt_benchmarks.core import MatSciOpt
mso = MatSciOpt(dataset="crabnet_hyperparameter")
print(mso.features) #
results = mso.predict(parameterization)
## Generate benchmark from existing dataset
```python
import pandas as pd
from matsci_opt_benchmarks.core import Benchmark
# load dataset
dataset_name = "dummy"
dataset_path = f"data/external/{dataset_name}.csv"
dataset = pd.read_csv(...)
# define inputs/outputs (and parameter types? if so, then Ax-like dict)
parameter_names = [...]
output_names = [...]
X = dataset[parameters]
y = dataset[outputs]
bench = Benchmark()
bench.fit(X=X, Y=y)
y_pred = bench.predict(X.head(5))
print(y_pred)
# [[...], [...], ...]
bench.save(fpath=f"models/{dataset_name}")
bench.upload(zenodo_id=zenodo_id)
# upload to HuggingFace
...
In order to set up the necessary environment:
environment.yml
and create an environment matsci-opt-benchmarks
with the help of conda:
conda env create -f environment.yml
conda activate matsci-opt-benchmarks
NOTE: The conda environment will have matsci-opt-benchmarks installed in editable mode. Some changes, e.g. in
setup.cfg
, might require you to runpip install -e .
again.
Optional and needed only once after git clone
:
install several pre-commit git hooks with:
pre-commit install
# You might also want to run `pre-commit autoupdate`
and checkout the configuration under .pre-commit-config.yaml
.
The -n, --no-verify
flag of git commit
can be used to deactivate pre-commit hooks temporarily.
install nbstripout git hooks to remove the output cells of committed notebooks with:
nbstripout --install --attributes notebooks/.gitattributes
This is useful to avoid large diffs due to plots in your notebooks.
A simple nbstripout --uninstall
will revert these changes.
Then take a look into the scripts
and notebooks
folders.
environment.yml
and eventually
in setup.cfg
if you want to ship and install your package via pip
later on.environment.lock.yml
for the exact reproduction of your
environment with:
conda env export -n matsci-opt-benchmarks -f environment.lock.yml
For multi-OS development, consider using --no-builds
during the export.
environment.lock.yml
using:
conda env update -f environment.lock.yml --prune
├── AUTHORS.md <- List of developers and maintainers.
├── CHANGELOG.md <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md <- Guidelines for contributing to this project.
├── Dockerfile <- Build a docker container with `docker build .`.
├── LICENSE.txt <- License as chosen on the command-line.
├── README.md <- The top-level README for developers.
├── configs <- Directory for configurations of model & application.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── docs <- Directory for Sphinx documentation in rst or md.
├── environment.yml <- The conda environment file for reproducibility.
├── models <- Trained and serialized models, model predictions,
│ or model summaries.
├── notebooks <- Jupyter notebooks. Naming convention is a number (for
│ ordering), the creator's initials and a description,
│ e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml <- Build configuration. Don't change! Use `pip install -e .`
│ to install for development or to build `tox -e build`.
├── references <- Data dictionaries, manuals, and all other materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated plots and figures for reports.
├── scripts <- Analysis and production scripts which import the
│ actual PYTHON_PKG, e.g. train_model.
├── setup.cfg <- Declarative configuration of your project.
├── setup.py <- [DEPRECATED] Use `python setup.py develop` to install for
│ development or `python setup.py bdist_wheel` to build.
├── src
│ └── particle_packing <- Actual Python package where the main functionality goes.
│ └── crabnet_hyperparameter <- Actual Python package where the main functionality goes.
├── tests <- Unit tests which can be run with `pytest`.
├── .coveragerc <- Configuration for coverage reports of unit tests.
├── .isort.cfg <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.
This project has been set up using PyScaffold 4.3.1 and the dsproject extension 0.7.2.