microsoft / responsible-ai-toolbox-mitigations

Python library for implementing Responsible AI mitigations.
https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/
MIT License
57 stars 6 forks source link
data-analysis data-science machine-learning python responsible-ai responsible-ml

Responsible AI Mitigations

This Responsible-AI-Toolbox-Mitigations repo consists of a python library that aims to empower data scientists and ML developers to measure their dataset balance and representation of different dataset cohorts, while having access to mitigation techniques they could incorporate to mitigate errors and fairness issues in their datasets. Together with the measurement and mitigation steps, ML professionals are empowered to build more accurate and fairer models.

This repo is a part of the Responsible AI Toolbox, a suite of tools providing a collection of model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.

ResponsibleAIToolboxMitigationsOverview The Responsible AI Mitigations Library helps AI practitioners explore different measurements and mitigation steps that may be most appropriate when the model underperforms for a given data cohort. The library currently has three modules: - **DataProcessing** offers mitigation techniques for improving model performance for specific cohorts. - **DataBalanceAnalysis** provides metrics for diagnosing errors that originate from data imbalance either on class labels or feature values. - **Cohort** provides classes for handling and managing cohorts, which allows the creation of custom pipelines for each cohort in an easy and intuitive interface. The module also provides techniques for learning different decoupled estimators (models) for different cohorts and combining them in a way that optimizes different definitions of group fairness. In this library, we take a **targeted approach to mitigating errors** in Machine Learning models. This is complementary and different from the traditional blanket approaches which aim at maximizing a single-score performance number, such as overall accuracy, by merely increasing the size of traning data or model architecture. Since blanket approaches are often costly but also ineffective for improving the model in areas of poorest performance, with targeted approaches to model improvement we focus the improvement efforts in areas previously identified to have more errors and their underlying diagnoses of error. For example, if a practitioner has identified that the model is underperforming for a cohort of interest by using Error Analysis in the Responsible AI Dashboard, they may also continue the debugging process by finding out through Data Balance Analysis and find out that there is class imbalance for this particular cohort. To mitigate the issue, they then focus on improving class imbalance for the cohort of interest by using the Responsible AI Mitigations library. This and several other examples in the documentation of each mitigation function illustrate how targeted approaches may help practitioner best at mitigation giving them more control in the model improvement process. ## Installation Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, 3.9 or 3.10. If running in jupyter, please make sure to restart the jupyter kernel after installing. There are three installation options for the ``raimitigations`` package: * To install the minimum dependencies, use: ``` pip install raimitigations ``` * To install the minimum dependencies + the packages required to run all of the notebooks in the ``notebooks/`` folder: ``` pip install raimitigations[all] ``` * To install all the dependencies used for development (such as ``pytest``, for example), use: ``` pip install raimitigations[dev] ``` ## Documentation To learn more about the supported dataset measurements and mitigation techniques covered in the **raimitigations** package, [please check out this documentation.](https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/) ## Data Balance Analysis: Examples - [Data Balance Analysis Walk Through](notebooks/databalanceanalysis/data_balance_overall.ipynb) - [Data Balance Analysis Adult Census Example](notebooks/databalanceanalysis/data_balance_census.ipynb) - [End to End Notebook](notebooks/data_balance_e2e.ipynb) ## Data Processing/Mitigations: Examples Here is a set of tutorial notebooks that aim to explain how to use each one of the mitigation methods offered in the **dataprocessing** module. - [Encoders](notebooks/dataprocessing/module_tests/encoding.ipynb) - [Scalers](notebooks/dataprocessing/module_tests/scaler.ipynb) - [Basic Imputer](notebooks/dataprocessing/module_tests/basic_imputation.ipynb) - [Iterative Imputer](notebooks/dataprocessing/module_tests/iterative_imputation.ipynb) - [KNN Imputer](notebooks/dataprocessing/module_tests/knn_imputation.ipynb) - [Sequential Feature Selection](notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb) - [Feature Selection using Catboost](notebooks/dataprocessing/module_tests/feat_sel_catboost.ipynb) - [Identifying correlated features: tutorial](notebooks/dataprocessing/module_tests/feat_sel_corr_tutorial.ipynb) - [Data Rebalance using imblearn](notebooks/dataprocessing/module_tests/rebalance_imbl.ipynb) - [Data Rebalance using SDV](notebooks/dataprocessing/module_tests/rebalance_sdv.ipynb) - [Using scikit-learn's Pipeline](notebooks/dataprocessing/module_tests/pipeline_test.ipynb) Here is a set of case study scenarios where we use the transformations available in the **dataprocessing** module in order to train a model for a real-world dataset. - [Simple Example](notebooks/dataprocessing/module_tests/model_test.ipynb) - [Case Study 1](notebooks/dataprocessing/case_study/case1.ipynb) - [Case Study 2](notebooks/dataprocessing/case_study/case2.ipynb) - [Case Study 3](notebooks/dataprocessing/case_study/case3.ipynb) ## Handling Cohorts Here is a set of tutorial notebooks that aim to explain how to manage cohorts. - [Creating Single Cohorts](notebooks/cohort/cohort_definition.ipynb) - [Creating Different Pipelines for each Cohort](notebooks/cohort/cohort_manager.ipynb) - [Different Pre-processing Scenarios using cohorts](notebooks/cohort/cohort_manager_scenarios.ipynb) - [Using Decoupled Classifiers](notebooks/cohort/decoupled.ipynb) Here is a set of case study notebooks showing how creating customized dataprocessing pipelines for each cohort can help in some scenarios. - [Cohort Case Study 1](notebooks/cohort/case_study/case_1.ipynb) - [Cohort Case Study 1 - Rebalancing only specific cohorts](notebooks/cohort/case_study/case_1_rebalance.ipynb) - [Cohort Case Study 1 - Using RAI Toolbox](notebooks/cohort/case_study/case_1_dashboard.ipynb) - [Cohort Case Study 2](notebooks/cohort/case_study/case_2.ipynb) - [Cohort Case Study 3](notebooks/cohort/case_study/case_3.ipynb) - [Decoupled Classifier Case 1](notebooks/cohort/case_study/decoupled_class/case_1.ipynb) - [Decoupled Classifier Case 2](notebooks/cohort/case_study/decoupled_class/case_2.ipynb) - [Decoupled Classifier Case 3](notebooks/cohort/case_study/decoupled_class/case_3.ipynb) ## Dependencies **RAI Toolbox Mitigations** uses several libraries internally. The direct dependencies are the following: - [Numpy](https://numpy.org/) - [Pandas](https://pandas.pydata.org/) - [SciPy](https://scipy.org/) - [Scikit Learn](https://scikit-learn.org/stable/index.html) - [ResearchPY](https://pypi.org/project/researchpy/) - [Statsmodels](https://www.statsmodels.org/stable/index.html) - [Imbalanced Learn](https://imbalanced-learn.org/stable/) - [SDV](https://pypi.org/project/sdv/) - [CatBoost](https://catboost.ai/en/docs/) - [XGBoost](https://xgboost.readthedocs.io/en/stable/python/python_intro.html) - [MLxtend](https://pypi.org/project/mlxtend/) - [UCI Dataset](https://pypi.org/project/uci-dataset/) ## Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. ### Installing Using ``dev`` Mode After cloning this repo and moving to its root folder, install the package in editable mode with the development dependencies using: ```console > pip install -e .[dev] ``` ### Pre-Commit This repository uses pre-commit hooks to guarantee that the code format is kept consistent. For development, make sure to activate pre-commit before creating a pull request. Any code pushed to this repository is checked for code consistency using Github Actions, so if pre-commit is not used when doing a commit, there is a chance that it fails in the format check workflow. Using pre-commit will avoid this. To use pre-commit with this repository, first install pre-commit (**NOTE:** when installing the package with the ``[dev]`` tag, the ``pre-commit`` package will already be installed): ```console > pip install pre-commit ``` After installed, navigate to the root directory of this repository and activate pre-commit through the following command: ```console > pre-commit install ``` With pre-commit installed and activated, whenever you do a new commit, pre-commit will check all new code using the pre-commit hooks configured in the *.pre-commit-config.yaml* file, located in the root of the repository. Some of the hooks might make formatting changes to some of the files commited. If any file is changed or if any other hook fails, the commit will fail. If that happens, make the necessary modifications, add the files to the commit and try commiting one more time. Do this until all hooks are successful. Note that these same checks will be done after pushing anything, so if your commit was successful while using pre-commit, it will pass in the format check workflow as well. ### Updating the Docs The documentation is built using [Sphinx](https://www.sphinx-doc.org/en/master/), [Pandoc](https://pandoc.org/installing.html), and [Graphviz](https://graphviz.org/) (to build the class diagrams). Graphviz and Pandoc must be installed separately ([detailed instructions here for Graphviz](https://graphviz.org/download/) and [here for Pandoc](https://pandoc.org/installing.html)). On Linux, this can be done with `apt` or `yum` (depending on your distribution): ```console > sudo apt install graphviz pandoc ``` ```console > sudo yum install graphviz pandoc ``` Make sure Graphviz and Pandoc are installed before recompiling the docs. After that, update the documentation files, which are all located inside the ```docs/``` folder. Finally, use: ```console > cd docs/ > make html ``` To view the documentation, open the file ```docs/_build/html/index.html``` in your browser. **Note for Windows users:** if you are trying to update the docs in a Windows environment, you might get an error regarding the *_sqlite3* module: ``` ImportError: DLL load failed while importing _sqlite3: The specified module could not be found. ``` To fix this, following the instructions found [in this link](https://www.dev2qa.com/how-to-fix-importerror-dll-load-failed-while-importing-_sqlite3-the-specified-module-could-not-be-found/). ## Support ### How to file issues and get help This project uses GitHub Issues to track bugs and feature requests. Please search the existing issues before filing new issues to avoid duplicates. For new issues, file your bug or feature request as a new Issue. For help and questions about using this project, please post your question in Stack Overflow using the ``raimitigations`` tag. ### Microsoft Support Policy Support for this package is limited to the resources listed above. ## Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies. ## Research and Acknowledgements **Current Maintainers:** [Marah Abdin](https://github.com/marah-abdin), [Matheus Mendonça](https://github.com/mrfmendonca), [Dany Rouhana](https://github.com/danyrouh), [Mark Encarnación](https://github.com/markenc) **Past Maintainers:** [Akshara Ramakrishnan](https://github.com/akshara-msft), [Irina Spiridonova](https://github.com/irinasp) **Research Contributors:** [Besmira Nushi](https://github.com/nushib), [Rahee Ghosh Peshawaria](https://github.com/raghoshMSFT), [Ece Kamar](https://www.ecekamar.com/)