ihmeuw / pseudopeople

pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.
https://pseudopeople.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 2 forks source link

Reducing Linkage Bias by Race #434

Closed Jo-Lam closed 1 month ago

Jo-Lam commented 1 month ago

What is the name of your project?

Reducing Linkage Bias by Race

What is the purpose of your project?

The purpose of our project is to compare the probabilistic record linkage strategies to identify which settings would be most sensitive to differential distribution of errors by race and ethnicity. Linkage errors, typically described as missed matches and false matches, occur disproportionately in same population groups than others. The consequences of not recognising the compromised representativeness of these large-scale linked administrative data would mean population policy derived using these data may not serve all populations equally.

From my previous work, I developed methods to corrupt identifiers (such as names) that is dependent on attribute variables (such as race and ethnicity) using a birth cohort in the United Kingdom. I was able to demonstrate the utility of the data generation, corruption and linkage framework in the previous work, but was limited by the small sample size (10,000), and the lack of racial diversity in the dataset to properly assess my proposed linkage methods.

On top of the corruption configurations available through pseudopeople, I intend to further corrupt the forename and surname of the simulated data, that is dependent on the individual's race and ethnicity. I intend to create multiple copies of the corrupted dataset, with varying level of corruption, to conduct a comparative analysis of whether my proposed linkage method would outperform other existing approaches to reduce linkage bias by race.

Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?

This project is part of my PhD at University College London, United Kingdom. I am supervised by Prof. Katie Harron (UCL), Dr Ruth Blackburn (UCL), Prof Mario Cortina Borja (UCL), Prof. Rob Aldridge (IHME, UW).

Prof Katie Harron is leading expert in data linkage methods and evaluation. Prof Mario Cortina Borja is a very experienced statistician with modelling, simulation and data perturbation. Dr Ruth Blackburn is a public health expert with rich experience working with UK public health records and administrative data systems. Prof Rob Aldridge oversees my work on linkage evaluation, and has contextual knowledge of US-based health administrative data.

As part of my supervision, all of them will have advisory role to my project, but they will not have direct access to the pseudopeople input data.

What funding is the project under? What expectations with respect to open access and access to data come with that funding?

Our project is funded by the Wellcome Trust. Essentially, this states that we have an obligation to share the corrupted, simulated dataset used for linkage and linkage evaluation. This is not the same as sharing the pseudopeople data, it is simply those variables and rows from the merged dataset that are used in the final analysis.

We commit to:

What data would you like to request?

Other data - more explanation

Ideally, I would want to link the population data to a health administrative data, such as the National Vital Statistics System.

aflaxman commented 1 month ago

This sounds great to me. Thanks for providing such clear details. @Ironholds : do you want any additional information?

aflaxman commented 1 month ago

I think @Ironholds is going to approve of this, so to keep this process moving, @Jo-Lam , can you email me at abie@uw.edu?

Ironholds commented 1 month ago

@aflaxman I am!

aflaxman commented 1 month ago

Access links sent by email. :)