Spelling/grammar/formatting issues

morrissharp commented 2 years ago

I think this is missing a blank line before the bulleted list. The bulleted list is not displaying properly in the docs.

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/83c26a3fc61f15c609734126bd4e76e0d922fdbc/raimitigations/databalanceanalysis/aggregate_measures.py#L36

morrissharp commented 2 years ago

In file https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/83c26a3fc61f15c609734126bd4e76e0d922fdbc/notebooks/databalanceanalysis/data_balance_overall.ipynb?short_path=14098f6#L15-L19

Suggest change to: Data Balance Analysis is relevant for the overall understanding of datasets, but is essential to building Machine Learning models in a responsible way, especially in term of fairness. It is all too easy to build an ML Model that produces biased results for subsets of the population by training or testing the model on biased ground truth data. There are multiple case studies of biased models assisting in granting loans healthcare, recruitment opportunities and many other decision-making tasks. In most of these examples, the data on which these models are trained was the common issue. These findings emphasize how important it is for model creators and auditors to analyze data balance:

to measure training data across various sub-populations
to ensure the data has good coverage, and a balanced representation of labels across sensitive categories and category combinations
and to check that the test data is representative of the target population

In summary, Data Balance Analysis has the following benefits when used for building ML models.

Reduces the risk of unbalanced models by:
- ensuring service fairness and reducing the costs of ML building by identifying data representation gaps early on
- prompting data scientists to seek mitigation steps before proceeding on the training portion of Machine Learning model development
Enables easy end-to-end debugging of ML systems in combination with Fairlearn by providing a clear view if an issue in a model is tied to the data or the model itself.

morrissharp commented 2 years ago

The docstring for FeatureBalanceMeasure.measures is not showing up: https://sturdy-barnacle-3b9f911d.pages.github.io/databalanceanalysis/databalanceanalysis.html#databalanceanalysis.feature_measures.FeatureBalanceMeasure.measures

Probably because it is not the first line in the function: https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/83c26a3fc61f15c609734126bd4e76e0d922fdbc/raimitigations/databalanceanalysis/feature_measures.py#L91-L94

morrissharp commented 2 years ago

The encoding example could use a few more comments:

Under Encoding Examples: A description of what is in the notebook.
Under One Hot Encoding / dataset without headers: quick description of what is going on in these cells

morrissharp commented 2 years ago

A number of single quotes in this docstring that should instead be a backtick, so that it will be formatted as code, e.g. https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/raimitigations/dataprocessing/feat_selection/sequential_select.py#L30-L31

morrissharp commented 2 years ago

From: https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/raimitigations/dataprocessing/feat_selection/sequential_select.py#L15-L16

The specific module and library should be highlighted as code:

Implements the sequential feature selection method using the mlxtend library -> Implements the SequentialFeatureSelector method using the mlxtend library

morrissharp commented 2 years ago

A number of single quotes in this docstring that should instead be a backtick, so that it will be formatted as code, e.g.

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/raimitigations/dataprocessing/feat_selection/sequential_select.py#L30-L31

Noting that this issue appears in much (if not all) of the documentation.

morrissharp commented 2 years ago

The math equation here is not being formatted correctly. Maybe there is something missing in the conf file for equations? https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/raimitigations/dataprocessing/sampler/rebalance.py#L76

https://sturdy-barnacle-3b9f911d.pages.github.io/dataprocessing/sampler/rebalance.html

morrissharp commented 2 years ago

backticks around cat_col, over_sampler, and transform_pipe, and a few grammatical changes

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/raimitigations/dataprocessing/sampler/rebalance.py#L60-L68

a list of names or indexes of categorical columns. If None, this parameter will be set automatically as a list of all the categorical variables in the dataset. These columns are used to determine the default SMOTE type that should be used: if `cat_col` is None, then use SMOTE; if `cat_col` represents all columns of the dataset, then use SMOTEN; if `cat_col` is a subset of columns of the dataset, then use SMOTENC. If a specific SMOTE object is provided in the constructor (using the `over_sampler` parameter), then the columns in `cat_col` will be automatically encoded using One-Hot encoding (`EncoderOHE`), unless another encoding transformer is provided in the `transform_pipe` parameter;

morrissharp commented 2 years ago

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/main/notebooks/dataprocessing/module_tests/feat_sel_catboost.ipynb "Notice that several logging information was generated by CatBoost. We can avoid this by setting the catboost_log parameter to False" -> "Notice that CatBoost logs information to the console during the run. We can suppress this output by setting the catboost_log parameter to False"

There's only one example with Regression: "First of all, let's create a dummy regression dataset so we can build a few examples" -> "First of all, let's create a dummy regression dataset for the next example."

morrissharp commented 2 years ago

It is somewhat confusing that the "new dataset" referred to here is not the same as new_df in the above cell. Maybe it could be referred to as "updated dataset"? https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/notebooks/dataprocessing/module_tests/feat_sel_corr_tutorial.ipynb?short_path=38cf628#L550

morrissharp commented 2 years ago

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/notebooks/dataprocessing/module_tests/feat_sel_corr_tutorial.ipynb?short_path=38cf628#L913

Section 4: "Differently to the previous scenarios" -> "In contrast to the previous scenarios"

morrissharp commented 2 years ago

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/notebooks/dataprocessing/module_tests/feat_sel_sequential.ipynb?short_path=6e5cbcf#L637

"5 features" -> "6 features"

morrissharp commented 2 years ago

There should be a link to the docs of skLearn's SimpleImputer here, since the parameter details are denoted in the sklearn docs, and not here.

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/0d69bb6db4ddf92db1870a265147e7458be0cf5f/raimitigations/dataprocessing/imputer/basic_imputer.py#L36-L37

morrissharp commented 2 years ago