scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.44k stars 25.26k forks source link

Doc/Discussion: Discrimination in examples and user guide #16715

Open lorentzenchr opened 4 years ago

lorentzenchr commented 4 years ago

Goal

A space to discuss thougtfully and amicably, if the topic of discrimination and bias should be addressed in the documentation (examples and user guide):

Non Goal

Endless and pointless discussions. So please, use every word wisely and sparingly.

Possible Solutions

Mention in example X that the data may contain bias and reference a link to a good source for further insights. Discover model bias in example Y for a certain feature subspace and investigate a little.

References

The examples Poisson regression and non-normal loss and Tweedie regression on insurance claims use an insurance dataset, example Common pitfalls in interpretation of coefficients of linear models uses a wages dataset. Some concern where pointed out in https://github.com/scikit-learn/scikit-learn/pull/16648#discussion_r388974502.

rth commented 4 years ago

Some concern where pointed out in #16648 (comment).

To copy this discussion here, for future reference

@adrinjalali :

The issue is that the data is biased and we can't even measure that bias because we don't even have the features we need (like race and gender). Not using those features doesn't mean we're not going to have biased model discriminating against a certain group, and I'm very very worried about putting an example out there which people would then use as a reference to [unintentionally] discriminate against people. @romanlutz, do you happen to have a good example for this one?

@rth :

I'm all for better examples for controlling for bias, but I also don't think this PR is the right place for this discussion. It merely refactors an existing example, we should have this discussion in a separate issue.

As a side note, I imagine there could indeed be some sample selection bias in the the data (i.e. company chooses customers), however the target variable (frequency or cost of accidents) shouldn't be too biased, I think? At least significantly less biased than in other examples such as scikit-learn.org/dev/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html for predicting wage. Also pricing policy of an insurance company doesn't directly impact how often one has accidents, so at least there is no immediate feedback loop with the training data. I'm not an expert on this, and there are likely other effects, but I'm just saying that we should discuss it in a separate issue.

@romanlutz:

I don't disagree with the suggestions to discuss this elsewhere. For more information on what @adrinjalali is referring to I recommend "Big Data's Disparate Impact" by Barocas and Selbst. file:///C:/Users/rolutz/AppData/Local/Temp/SSRN-id2477899.pdf section I.D. The entire piece is actually relevant for such a scenario, but that's the section that discusses excluding sensitive features such as race & gender. Other than that there's still potential bias in how the data got collected (I.A, I.B, I.C). I think there's value in acknowledging such potential shortcomings so that people don't assume that it's the best (or only) way to approach the task. We wouldn't want users to end up on my list of questionable or unethical use cases. With the acknowledgment it should be clear that it's just a demonstration of scikit-learn. At least that's my point of view :-) Thanks for asking!

rth commented 4 years ago

Thanks @lorentzenchr we don't actually have that much discussion on this topic in scikit-learn so far.

I would propose to add,

I'm not sure how useful it is to say in examples using real-word datasets that the data is biased. Since most data, particularly related to human interaction, is biased in some way.

rth commented 4 years ago

To go back to the car accidents dataset,

The issue is that the data is biased and we can't even measure that bias because we don't even have the features we need (like race and gender)

Not using those features doesn't mean we're not going to have biased model discriminating against a certain group, and I'm very very worried about putting an example out there which people would then use as a reference to [unintentionally] discriminate against people.

Absolutely I agree about not using sensitive features not being a solution in general. I'm just not fully sure how this applies to the car accidents dataset.

Sure, there could be some data collection bias. Still putting aside regulatory constraints & practices in insurance industry (of which I don't know much), say we are trying to model car accidents frequency. Generally there are variations for variables that could be considered protected, e.g.,

and that the model reflects that. Would you then say that the data is biased (with an age/gender discrimination) and that the model needs be corrected for it? I can understand doing that from a social fairness point of view, but I still wouldn't call it problematic bias in the training data in this case.

lorentzenchr commented 4 years ago

I my point of view, one has make a distinction between

I'd also like to mention Discrimination-Free Insurance Pricing, which shows that one might need all possibly discriminatory features in order to compute discrimination-free prices.

romanlutz commented 4 years ago

I'm not an anthropologist or domain expert, so I certainly don't want to make the claim that I can tell you whether this is fine or not. I have come across many applications that seemed benign at first and turned into something much less benign when deployed in practice. For that purpose I've put together a list of questionable or outright unethical use cases https://github.com/romanlutz/ResponsibleAI .

I do understand that the purpose here is simply to show how scikit-learn is used for a certain task. The point of the notebook is not necessarily to say "this is how you approach this problem and this is the one and only solution". That can also be acknowledged to be 100% transparent about this and to make sure people don't get the wrong idea.

About the actual criticism of the use case:

I'm sure actual domain experts would be able to point out dozens more questions one could ask. I barely scratched the surface here, without even looking at any data. The point was merely to show that a supposedly simple scenario isn't always as simple since this data represents the real world which is much more complicated than just features and labels/scores.

@lorentzenchr captures this: "This raises ethical questions about fairness and solidarity that a society has to answer, e.g. via laws and institutions like regulators. I would not expect scikit-learn to comment on that." The point is that scikit-learn is so popular that perhaps you need to make this absolutely clear. I'll reiterate my sentence from the start: That can also be acknowledged to be 100% transparent about this and to make sure people don't get the wrong idea.

I'm not a contributor to this project and just shared my 2cts because @adrinjalali asked. I certainly appreciate the situation you find yourselves in, but I think a little disclaimer/acknowledgment of the problem would go a long way (and without you having to replace the entire scenario). It's your call, of course.

lorentzenchr commented 4 years ago

The current state seems to be that contributions in form of user guide sections and examples are welcome as @rth pointed out:

I would propose to add,

  • (possibly) a small section to the user guide under "3. Model selection and evaluation" regarding bias in data in models, with a few classical examples of applications and references to further reading. Don't we have an issue about it already somewhere @adrinjalali ?
  • an example, that controls for bias with a few protected variables (for instance on the wage dataset). If possible to correct the model for it without using external dependencies, or otherwise point to relevant external resources.