ihmeuw / pseudopeople

pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.
https://pseudopeople.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 2 forks source link

Data access request #205

Closed aalexandersson closed 11 months ago

aalexandersson commented 1 year ago

What is the name of your project?

Comparing the accuracy and speed of Match*Pro, fastLink, and splink at the Florida Cancer Data System

What is the purpose of your project?

Use the pseudopeople Python package to compare the accuracy and speed of Match*Pro, fastLink, and splink for linkage data requests at the Florida Cancer Data System (FCDS). Medium-to-large linkages at the FCDS means using input datasets with approximately 250,000 * 4,000,000 records, which in terms of size corresponds to the simulated "Rhode Island" pseudopeople population of 1,000,000 people. The FCDS reports to the Florida Department of Health (FDOH).

The FCDS uses Match*Pro for the NAACCR Virtual Pool Registry Cancer Linkage System (VPR-CLS) "Phase 1" linkages which do not use a clerical review. In contrast, "VPR-CLS Phase 2" linkages use a clerical review for more accuracy. The FCDS uses fastLink for regular linkage data requests, including but not limited to "VPR-CLS Phase 2" linkages, because fastLink was more accurate in prior FCDS testing. The FCDS recently used preliminary "Rhode Island"-sized data from Abraham Flaxman which was very helpful to determine that splink is a feasible alternative to fastLink.

Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?

Anders Alexandersson - Senior Research Associate (FCDS) - Direct access (main person to work with the data) Brad Wohler - Manager of Statistics (FCDS) - Direct access David Lee - Project Director and Principal Investigator (FCDS) - Direct access Gary Levin - Deputy Project Director (FCDS) - Direct access Mark Rudolph - Manager of Computers/Systems Programmer (FCDS) - Direct access (for security reasons) Heather Lake-Burger - Registries and Surveillance Administrator (FDOH) - NO direct access (will receive the report with findings)

Contact info at https://fcds.med.miami.edu/inc/staff.shtml.

What funding is the project under? What expectations with respect to open access and access to data come with that funding?

The FCDS is funded by FDOH and the Centers for Disease Control and Prevention’s National Program of Cancer Registries (CDC-NPCR). The project is funded by FDOH (Contract CODJU) and CDC through the NPCR (DP003872-04).

The funding does not come with stated expectations with respect to open access and access to data. However, the end goal of the FCDS is to have open access (public) data for fully transparent comparisons of the the accuracy and speed of probabilistic record linkage software using simulated (artificial) data. Therefore, if the project is successful, it is realistic to expect that the FCDS and FDOH would like to 1) share the findings in a report with the Match*Pro, fastLink, and splink developers, and 2) that the report will have some individual level data, for example a listing of the first or last 5 records.

We commit to:

What data would you like to request?

Other data - more explanation

The FCDS needs more noise errors to the already provided "noisy" pre-pseudopeople "Rhode Island" level data from Abraham Flaxman in at least two ways:

  1. The major limitation with the provided data for the FCDS is that it has no errors in simulated Social Security Number (SSN), only 15% missingness. The FCDS needs to compare partial matches in SSN using the Damerau-Levenshtein string distance in the three software because the FCDS often has incomplete access to SSN data but SSN has, say, 3-5% noise errors such as typos (including transpositions) and fake/wrong use such as the SSN of a family member.

  2. The FCDS also needs some noise in date of birth which currently is without noise in the provided "noisy" dataset, which is not realistic.

Update: We are using Python 3.11. Currently, pseudopeople is not compatible with Python 3.11.

Ironholds commented 1 year ago

This looks good to me - @aflaxman ?

aflaxman commented 1 year ago

Agree, thanks for this request. I'll follow up via email with details on how to access the Rhode Island data as well as how to configure noise in the SSN and DOB as you have described. :)

aflaxman commented 11 months ago

I forgot to close this issue when I send @aalexandersson his data access link... closing now.

aalexandersson commented 2 months ago

The report has been completed and is now under review by FDOH. I will present the findings at the 2024 IPDLN conference in September. I expect the report to be public by then. I will attend your @aflaxman workshop at IPDLN and will be happy to share an unofficial copy of the report with you before then, if you would like.

I have an ask: May I share the derived dataframes df1 and df2 privately with the lead developer Robin Linacre of Splink? I upgraded Splink from version 3.9.15 to 4.0.0 but ran in to an issue with Splink that is difficult to reproduce without the dataframes df1 and df2. See Splink discussion issue 2316. I am open to other suggestions, for example, to share the data with you and see if you can reproduce the Splink issue. Please let me know what you prefer.

aflaxman commented 2 months ago

@Ironholds : this is a pretty interesting edge case that we have not run into yet --- it is exactly the sort of use of pseudopeople that I was hoping this simulated data would enable. Maybe a way to think about it is a modification to @aalexandersson 's original proposal, to add Robin Linacre to the list of people with access to pseudopeople data for this project. Do you have any concerns or clarifying questions about sharing df1 and df2?

aalexandersson commented 2 months ago

Could the Splink discussion issue 2316 be related to the pseudopeople requirement of numpy < 2.0? I will try to test this. Are you able to use pseudopeople==1.1.1 and splink==4.0.0 in requirements.txt?

aflaxman commented 2 months ago

I did a quick test of pseudopeople / numpy versions on google's colab, and it seems quite fussy about versions right now! Since @Ironholds has not objected to your request to share the df1 and df2 data, you may proceed with sharing this pseudopeople output to him.

Ironholds commented 1 month ago

I have not because I don't!