Data-Centric What-If Analysis for Native Machine Learning Pipelines.
This project uses the mlinspect project as a foundation, mainly for its plan extraction from native ML pipelines.
Prerequisite: Python 3.9
Clone this repository (optionally, with Git LFS, to also download the datasets for the scalability experiment)
Set up the environment
cd mlwhatif
python -m venv venv
source venv/bin/activate
If you want to use the visualisation functions we provide, install graphviz which can not be installed via pip
Linux:
apt-get install graphviz
MAC OS:
brew install graphviz
Install pip dependencies
SETUPTOOLS_USE_DISTUTILS=stdlib pip install -e ."[dev]"
To ensure everything works, you can run the tests (without graphviz, the visualisation test will fail)
python setup.py test
mlwhatif makes it easy to analyze your pipeline and automatically run what-if analyses.
from mlwhatif import PipelineAnalyzer
from mlwhatif.analysis import DataCleaning, ErrorType
IPYNB_PATH = ...
cleanlearn = DataCleaning({'category': ErrorType.CAT_MISSING_VALUES,
'vine': ErrorType.CAT_MISSING_VALUES,
'star_rating': ErrorType.NUM_MISSING_VALUES,
'total_votes': ErrorType.OUTLIERS,
'review_id': ErrorType.DUPLICATES,
None: ErrorType.MISLABEL
})
analysis_result = PipelineAnalyzer \
.on_pipeline_from_ipynb_file(IPYNB_PATH)\
.add_what_if_analysis(cleanlearn) \
.execute()
cleanlearn_report = analysis_result.analysis_to_result_reports[cleanlearn]
We prepared a demo notebook to showcase mlwhatif and its features.
--no-cov
(Link)--log-cli-level=10 -s
. The -s
is needed because otherwise pytest breaks the stdout capturing.This library is licensed under the Apache 2.0 License.