ihmeuw / pseudopeople

pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.
https://pseudopeople.readthedocs.io
BSD 3-Clause "New" or "Revised" License
20 stars 2 forks source link

Data access request from DART project #472

Closed XingqiaoWang closed 1 month ago

XingqiaoWang commented 1 month ago

What is the name of your project?

From Smart Curation to Socially Aware Decision Making

What is the purpose of your project?

This project aims to establish a consortium of Arkansas researchers focused on advancing excellence in data analytics. It seeks to develop a statewide Data Science and Analytics educational ecosystem by creating consistent, modular education in data science, collaborating with industry, and enhancing the research competitiveness of Arkansas. Key research goals include increasing the speed of data curation, enhancing privacy techniques, improving the interpretability of machine learning processes, and developing inclusive data science curricula

Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?

Dr. John Talburt, Dr. Xiaowei Xu, Dr. Mariofanna Milanova are professors in UALR, they acted as IPs, Dr. Xingqiao Wang acted as post-doctoral fellow will have direct access to the pseudopeople input data.

What funding is the project under? What expectations with respect to open access and access to data come with that funding?

The project is funded by the National Science Foundation (NSF) under the 'AEDC (NSF EPSCOR): Robust and Trusted Data Analytics' grant. This grant requires that all data generated be made publicly accessible within six months after the project’s completion. We are committed to complying with these guidelines and ensuring data is available to other researchers following FAIR principles.

We commit to:

What data would you like to request?

Other data - more explanation

No response

aflaxman commented 1 month ago

Thanks for your interest in using pseudopeople! Can you elaborate on what you will be using the Full US pseudopeople data for in this project, and what parts of it will be made publicly accessible?

XingqiaoWang commented 1 month ago

Thank you for the opportunity to elaborate. In this project, we are utilizing an embedding-based approach to conduct entity resolution, aiming to improve accuracy and scalability by leveraging synthetic data. Currently, our evaluations are based on the 1 million synthetic records provided by the US Census, which have been instrumental in benchmarking our methods. However, to further refine and validate our approach, we need a larger dataset that more closely approximates real-world conditions.

The Full US pseudopeople dataset would allow us to scale up our testing and explore how our entity resolution method performs in terms of both recall and precision on a broader range of data variability. Additionally, this would enable us to better assess our model's adaptability across diverse entity types, households, and other synthetic entity configurations.

For public access, we plan to share only aggregated performance metrics and anonymized results from our experiments, ensuring that no individual data entries from the pseudopeople dataset are directly accessible. This will enable the wider research community to benefit from our findings without compromising the integrity or proprietary nature of the dataset itself.

On Tue, Oct 29, 2024 at 5:43 PM Abraham Flaxman @.***> wrote:

Thanks for your interest in using pseudopeople! Can you elaborate on what you will be using the Full US pseudopeople data for in this project, and what parts of it will be made publicly accessible?

— Reply to this email directly, view it on GitHub https://github.com/ihmeuw/pseudopeople/issues/472#issuecomment-2445459239, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANMAGMZFQ6YHB7YNDUGLPVDZ6AFRLAVCNFSM6AAAAABQ22WRU6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBVGQ2TSMRTHE . You are receiving this because you authored the thread.Message ID: @.***>

aflaxman commented 1 month ago

Super, you have our approval. Please email me at abie@uw.edu to proceed. :)