Add bias detection to preprocessing

Lilly-May commented 7 months ago

PR Checklist

[X] This comment contains a description of changes (with reason)
[X] Referenced issue is linked (#647)
[x] If you've fixed a bug or added code that should be tested, add tests!
[x] Documentation in docs is updated

Description of changes

This is a very drafty PR to get the discussion on the bias module started. The general idea is to have one method that calls several submethods:

Pairwise correlations between features
Standardized mean differences for all features between groups in sensitive features
Feature importances (Is a sensitive feature an important predictor of another feature?)

I thought the usage to be something along the lines of:

# adata is expected to be encoded
ep.pp.bias_detection(adata, sensitive_features=["ethnicity", "sex"])

I also made some necessary adjustments on the feature importances method. Firstly, I changed the default model from regression to rf (Random Forest) as I found this to be a lot more reliable when testing it for the bias detection, so I believe this should be the default. I also added two parameters to include the options to not log during the calculation of feature importances and to return the prediction score (R2 or accuracy).

Discussion points

Do we make the submethods (_feature_correlations and _standardized_mean_differences) public as well? If not, should I move everything into one big method since otherwise the submethods (e.g., the correlations one) are quite short, and also we would regenerate a pandas DataFrame from the AnnData several times.
It would be great to offer a more explorative way to detect potentially unknown sensitive features and biases in the data. Basically, we would have the option sensitive_features="all". However, I think we can't run the feature importances then because it gets too computationally expensive. Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?
How de we call the big method computing all different measurements? generate_bias_report? bias_detection?
How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?
How do we show the output? Should we print out a table (similar to scCODA in pertpy)? We could also (but this might be a weird idea) think if it would be possible to generate an (additional) HTML report which already includes plots. But that would be quite different from the rest of the API, so I don't know if we want to go for that.

ToDos

[x] Resolve ToDos in the code
[x] Look into and adjust the code with respect to differences for continuous and categorical features (waiting on #637 for that)
[x] Add tests
[x] Add examples
[x] Save standardized mean differences in uns subdict
[x] Raise error if data are not already encoded
[x] Introduce copy parameter
[x] Fix test warnings

Zethson commented 7 months ago

Thank you so much!

Do we make the submethods (_feature_correlations and _standardized_mean_differences) public as well? If not, should I move everything into one big method since otherwise the submethods (e.g., the correlations one) are quite short, and also we would regenerate a pandas DataFrame from the AnnData several times.

If all of these are fast and don't have too many configurable parameters, it's easier to have a one-stop function. There are ways to circumvent the need to regenerate the Pandas DataFrame such as class variables, cached properties, and other ideas but let's assume that we don't need that for now.

It would be great to offer a more explorative way to detect potentially unknown sensitive features and biases in the data. Basically, we would have the option sensitive_features="all". However, I think we can't run the feature importances then because it gets too computationally expensive. Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?

Yeah, that sounds fair to me. Ideally, the fewer parameters the better. That's a rule for every function that we implement. Plotting functions can be exceptions.

How de we call the big method computing all different measurements? generate_bias_report? bias_detection?

detect_bias? When I hear report I always think of an actual HTML/PDF report...

How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?

I'd save all of them and not just the ones that we find potential biases for. We just calculate and the interpretation/usage is up to the user (we provide the tools to do so). So yes varm.

How do we show the output? Should we print out a table (similar to scCODA in pertpy)? We could also (but this might be a weird idea) think if it would be possible to generate an (additional) HTML report which already includes plots. But that would be quite different from the rest of the API, so I don't know if we want to go for that.

I veto HTML reports because they're annoying to interpret and it'd be much cooler to generate whole reports for this including the cohort tracking etc in the future. This would be an entirely different subproject. One learning of mine is that Rich tables are nice, but sometimes troublesome and cannot be resized. Therefore, I always prefer DataFrames over Rich tables and plots over no plots. Ideally, this function could return a summary table and store extensive results in the varm slots?

Lilly-May commented 7 months ago

Thanks for the review @Zethson!

I'll summarize the main things I'm going to change here:

[x] Put everything into one method
[x] Introduce sensitive_features="all" option
[x] Rename method to detect_bias
[x] Save all calculations in the adata
[x] Return a summary table with detected biases exceeding the respective threshold specified as parameter (I have to think a bit more about how I could summarize everything in ideally one pandas dataframe)

One more thing to think about: Do we also want to offer plots in some way? Would the summary dataframe be the input, or should it work with the values stored in the adata without additional user input?

Zethson commented 7 months ago

One more thing to think about: Do we also want to offer plots in some way? Would the summary dataframe be the input, or should it work with the values stored in the adata without additional user input?

I don't think that we need to offer more plots. I'd much rather consider integrating this with @eroell cohort tracking or TableOne down the road.

eroell commented 7 months ago

Cool!

Thoughts on the implementing side of things:

Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?

This function will likely be very verbose with many arguments. It is probably easier to have argument defaults consistent, and not interacting with each other? :)

How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?

I'd save all of them and not just the ones that we find potential biases for. We just calculate and the interpretation/usage is up to the user (we provide the tools to do so). So yes varm.

Hm. A bit more involved, no? The correlation is a n_var x n_var matrix; but only if we compute a correlation also between (categorical, categorical) and (categorical, numerical), for which we have not paved the way yet right

The feature importance is part of .var if I got this right, since calling a "native" ehrapy function this would be naturally accounted for indeed :)

The standardized mean difference: this is quite a lot going into the direction of rank_features_groups - there, we'd even have things for categoricals/continuous stuff already, and cool test statistics, and so on. Do we want yet another one here? If smd is important, it maybe even should be in rank_features_groups, and from there be called here I think But to not slow things here down too much, we can also go ahead keeping it here as you suggested - and take care of fancier things later :)

Zethson commented 7 months ago

@eroell you are correct :) Thanks!

theislab / ehrapy

Add bias detection to preprocessing #690