pymetrics / audit-ai

detect demographic differences in the output of machine learning models or other assessments
MIT License
310 stars 45 forks source link

Implementation suggestions: How to use the "debias set" for different roles? #30

Open kevinrobinson opened 3 years ago

kevinrobinson commented 3 years ago

hello! šŸ‘‹ In the implementation suggestions, I'm curious about this bit:

Pymetrics models are built for specific roles within specific companies. To achieve this customization, we collect data from top-performing incumbents in the target role. We then compare incumbents to a baseline sample of the over 1 million candidates who have applied to jobs through pymetrics. We also establish a special data set, which we call the debias set, which is sampled from a pool of 150,000 individuals who have voluntarily provided basic demographic information such as sex, ethnicity or age. From there, a wide variety of algorithms might be tested to create an initial machine learning model from the training data. The process itself is model agnostic. Multiple algorithms are fit in this process, and we are continuously testing new methods that might improve performance. The goal of the algorithm is to find the features that will most accurately and reliably separate the incumbent set from the baseline set.

In particular, the selection method for the baseline sample and the "debias set" seems really important! Is this secret or can you share the methodology you're using? I'm trying to understand how it fits in with the dataset that I would provide.

Thanks for sharing this work in the open and the explanations in the examples too, it's super interesting! šŸ‘

ljbaker commented 3 years ago

Hey! No problem, there's nothing particularly secret sauce about this.

There are three groups of data at play here.

  1. The incumbent set is just your target set. For us, it's people who who have demonstrated success in a role (e.g., good management ratings, high sales, high customer satisfaction, etc.). For other projects it could be people who needed a medical procedure or who paid off their mortgages.
  2. The baseline is your reference group. For us, it's the applicant pool for a job, but for other projects it could be people in good health or who defaulted on their home loans. For new models that don't have historic applicant data, we can build baseline sets from the anticipated applicant pool based on applicants to similar roles. For the purposes of this tool, though, the baseline could be any comparison group.
  3. The debias set is a held out group with representative demographics in the applied setting of the model, and you are correct, it is both important and tricky. On the one hand, you could use a large, general dataset for debiasing (e.g., worldwide applicants to all jobs) or a dataset tailored to a very specific application (e.g., car mechanics in Honduras). It's a classic case of overfitting: if you use a general dataset, you might miss nuances in the specific application of the model, but if you use a tailored dataset, you might pick up on existing patterns of discrimination in the industry (e.g., car mechanics may have a higher incidence of being male). I can't really guide you one way or another, but we prefer to go more general in our selection of a bias set so that we don't over-fit to one group.

Hope that helps!

kevinrobinson commented 3 years ago

@ljbaker This is awesome, thank you for the thoughts! šŸ˜„

I'll put together a more concrete example and ping back in the next few weeks - I think it'll help me clarify what the library here provides, and what parts of the implementation suggestions are specific to the data that pymetrics has previously collected (and not part of this open source package, so would need to be figured out separately).

I very much hear the context-specific nature of what you're describing, so perhaps a concrete example could help with that. Am I reading the example notebooks correctly, and they all use a single dataset artifact, and thus sidestep the difficult choices related to selecting baseline and debiasing sets that you describe here?

Thanks again for sharing your work in the open, and for your time! šŸ‘

ljbaker commented 3 years ago

Hi @kevinrobinson -- sorry for the delay, lost this in my inbox.

In the templates we share here, you are correct, we use a single baseline and bias set that has been established for the role. In practice, this would be the applicant pool for a job and demographically labeled applicants to this and other roles.

In a practical example, we can use medical outcomes (I'm borrowing a bit from an example I heard once by Said Obermeyer). Say that you want to identify individuals who would benefit from a preventative medicine intervention. You take people who have gone through the treatment successfully (target or incumbent set). That's the only easy part.

For the baseline, there are a couple options. First, you could compare your incumbents to people who have declined, dropped out, or showed negligible impact from the intervention. Second, you could compare your incumbents to the baseline health outcome of (a) the hospitals used in the sample, (b) the general US population or (c) patients who did not take the intervention who are roughly matched by age, sex, ethnicity, socioeconomic status and prexisting health conditions.

Ideally you'd want your bias set to be identical to your baseline, but to be a special held out set that has demographic information.

Selecting your baseline and bias sets can have pretty big differences in your estimated and real outcomes. Obermeyer discussed how a hospital might be located in an area with large race inequality (e.g., Los Angeles). A more general baseline might create a fair, accurate model on the average, but exacerbate existing inequalities in the region.

I think I'm just rambling at this point. Good luck with whatever you're working on, happy to give you feedback if you want it!

kevinrobinson commented 3 years ago

@ljbaker No worries, this is super helpful! I think the crux of what I'm trying to think through is that:

Selecting your baseline and bias sets can have pretty big differences in your estimated and real outcomes.

So to follow along with your context of predictive healthcare, I'll try make it a bit more concrete and see if that helps work forward :) Let's say we're looking at predicting how well an asthmatic patient will respond to a new form of preventative inhaler treatment at Los Angeles Community Hospital.

It sounds quite challenging to get started with constructing the baseline and bias sets. Is the first step before we can begin to collect data over a period of time for all asthmatic patients, demographic data, health outcomes? That seems like it would cost a lot in terms of time and effort, and that I'd be on my own without being able to collaborate with other hospitals. But, at the same time, there's the risk that if I collaborate with other hospitals on collecting that data about asthmatic patients, I've tangled in other confounds related to important differences between the hospitals, quality of care, and the patients they serve. So I'm trying to figure out how I might get started with the approach discussed in this library.

If it's simpler, I've actually been thinking about this in the context of application screening in the hiring process for a technical role, which is why I thought this would be a good place to ask :) We can switch analogies to that if it that would simplify things - the social and legal context is kind of different in healthcare, as are the costs of false positives and false negatives! If it would help I could also put together a notebook trying to show what my current understanding is, and maybe that be a better way for us to work through more concretely where audit-ai fits in and how I'd use it with the debias set in particular.

Thanks again for your help in thinking through this šŸ‘

ljbaker commented 3 years ago

Hey @kevinrobinson

No worries. What you're describing is the cold start problem of getting sufficient data to get a good model to production, and it will always be an issue. In either the medical or hiring scenario (we can switch to that for the ease of being targeted to what you're doing), you'll have to treat the collection of data that includes fairness information as a necessary requirement for launch.

Looking at your asthma case, it's probably easier than your hiring case, depending on the kinds of data you're collecting. Ideally you'd want to collect enough data from LACH to train the model (identifying at-risk asthma patients that respond well to treatment from the rest of the hospital's respiratory patients), and then a dataset from other hospitals as a hold-out set. The benefit of the package as used here is that you don't have to know the true labels of the bias set; you're merely using it to assess the fairness of the model. The same thing applies to the hiring example. You build a model on a target population with the intent to maximize predictiveness, and then test on a held-out set for model fairness.

There is a lot of debate in the fairness community on ways to account for fairness. This package is an example of how to achieve statistical parity, know in the EEOC language as "avoidance of adverse impact". I'll plug here that statistical parity is the law of the land in the USA for hiring (i.e., your algorithm should approach a 1:1 selection).

Now, there are other people in the machine learning community who propose an equalized odds framework. The basic assumption here is that you should maximize fairness for correct answers. So if your model is 85% accurate overall, it should be 85% accurate for White people and 85% accurate for Black people. Totally makes sense when you consider unbalanced training data. Say that n_white = 100 and n_black = 10. The goal of an equalized odds framework is to prevent a 91% accurate model that selects all White people and misclassifies all Black people. Conversely, though, you could have a model that is 100% accurate on the training data that still selects White applicants at a higher rate than Black applicants when in production, which violates the statistical parity approach. There's been a ton of debate on which approach matters more in which context (Moritz Hardt and Solon Barocas have written several great articles on this topic).

So that's a long way of saying that you'd need to conduct data collection on your target population that has target labels (e.g., asthma/no asthma; hire/no hire) and then data collection on a broader demographic population that is applicable to your use-case (other hospitals, other tech jobs). The broader demograpic population does not require labels, although if you were lucky enough to get labels you could greatly expand the possible algorithms you could try to reduce bias (which are not included in this package...yet).