ihmeuw / pseudopeople

pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.
https://pseudopeople.readthedocs.io
BSD 3-Clause "New" or "Revised" License
20 stars 2 forks source link

Requesting Large Scale Simulated Data #377

Closed joyantabasak13 closed 9 months ago

joyantabasak13 commented 9 months ago

What is the name of your project?

Improved Use of Clustering Methods for Record Linkage

What is the purpose of your project?

The goal of this project is to develop efficient, scalable, and robust record linkage algorithms. Given multiple data sets, the problem of record linkage is to cluster them such that each cluster has all the information about a single entity and does not contain any other information. This problem has numerous applications in domains such as healthcare, law enforcement, medicine, census data analysis, etc. The performance of record linkage algorithms is measured with two metrics, namely, run times and accuracy. Record linkage has been studied extensively and numerous algorithms have been proposed. These algorithms take a very long time especially when the input data sets are large. Many applications of interest call for real-time or very nearly real-time performance. Thus there is a crucial need for the creation of novel record linkage algorithms that are very fast while maintaining a very good accuracy. This project aims to develop such algorithms.

Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?

Sanguthevar Rajasekaran, a professor and head of the university’s Computer Science and Engineering Department, is leading the team. Rajasekaran is a pioneer in randomized parallel algorithms and big data. Co-principal investigator Ofer Harel is a professor and associate dean of Research and Graduate Affairs at the University of Connecticut. He has specific expertise in incomplete data techniques. Sartaj Sahni, a co-principal investigator, is a distinguished professor in the University of Florida’s Department of Computer and Information Science and Engineering. His research publications and patents are on the design and analysis of efficient algorithms, parallel computing, interconnection networks, design automation, and medical algorithms. In addition, the following researchers in Dr.Rajasekaran's lab at the University of Connecticut are working on this project at various scopes. Joyanta Basak, Graduate Research Assistant, University of Connecticut. Ahmed Soliman, Research Assistant, University of Connecticut. Nachket Deo, Research Assistant, University of Connecticut.

What funding is the project under? What expectations with respect to open access and access to data come with that funding?

The project is funded by the US Bureau of Census. We have written a cooperative plan and a data-sharing and management plan to disseminate our research findings with The Census Bureau as well as with the wider research community. We shall provide open access to the algorithms and record linkage tools we are developing in this project.

We commit to:

What data would you like to request?

Other data - more explanation

Our existing sequential algorithms can efficiently link several millions of records within a reasonable amount of time. We developed a few parallel record linkage algorithms recently. We need larger datasets to assess the performance of these newly developed algorithms. Ideally, we would like to have datasets of size 25 Million, 50 Million, 75 Million & 100 Million records before testing on the full US dataset (330 Million). We are requesting access to datasets of these scales or scales closer to them if possible, or guidelines about generating datasets of these scales from the full US data.

Ironholds commented 9 months ago

So the goal is to use the data as a tool for testing linkage, but not dig into the data itself? (that is: you do not have an interest in the specific individual ``people'' in pseudopeople?)

joyantabasak13 commented 9 months ago

So the goal is to use the data as a tool for testing linkage, but not dig into the data itself? (that is: you do not have an interest in the specific individual ``people'' in pseudopeople?)

@Ironholds Right. We are interested in linking records and evaluating our algorithms. This linking process requires comparing similar attributes (ex. first name, age, etc.) of different records. The ideal output would be a bunch of groups of records where each group contains all records belonging to a single person and no other record. Our scope ends here. We don't want to go further and analyze those grouped records of each person.

Ironholds commented 9 months ago

Makes sense; good for me @aflaxman

aflaxman commented 9 months ago

Agree! I'll send access details to @joyantabasak13 at the UConn email I have for you, if that still works.

joyantabasak13 commented 9 months ago

Just some feedback.

  1. The full US data file size is quite large. The version that was shared with me is 811GB. It might be useful to include some kind of checksum file (ex. MD5) so that one can check and be sure there had been no error in the download.

  2. It might be useful to mention the space requirements (zipped & unzipped) somewhere in the documentation. Because of the sheer size of the data, it's not workable with machines with the usual amount of disk space. So providing such information helps planning for prospective users.

zmbc commented 9 months ago

Hi @joyantabasak13!

You can find checksums here: https://pseudopeople.readthedocs.io/en/latest/simulated_populations/index.html#validating-the-simulated-population-data

Point 2 is a great suggestion, we should probably add that information to the page I linked. Thanks for your feedback!