ihmeuw / pseudopeople

pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.
https://pseudopeople.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 2 forks source link

[Data access request]: Workshop demonstrating linkage methods with large-ish data #394

Closed zmbc closed 4 months ago

zmbc commented 8 months ago

What is the name of your project?

Workshop demonstrating linkage methods with large-ish data

What is the purpose of your project?

We are considering hosting a workshop in which we demonstrate linkage with medium-size (~1 million row) datasets using different software packages. The aim is to show participants (who will be record linkage practitioners, such as social science researchers) how to use software they may not have used before, and compare the features of different tools. In order to do this in a workshop setting, we need some data that is big enough and isn't actually PII, but realistic. We think the RI data could be a great fit for this.

The linkages we do in the workshop with the pseudopeople-simulated data won't be the focus -- the real goal is for practitioners to apply the lessons they learn messing around with this data to their actual research questions.

Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?

I already have access to the pseudopeople input data, as a member of the pseudopeople team 😃

The major data access request here would be to give workshop participants at a conference (temporary) access to the RI data for use during the workshop. I'm thinking we would frame it like so:

We're giving you temporary access to the RI data for use during this workshop. While these data are simulated, they look realistic, and we don't share these data on the open internet. Please delete these data after the workshop ends. If you'd like to access these data, or the larger USA-scale data, after the conference, you can fill out a data access request on our GitHub repository -- it only takes a few minutes!

What funding is the project under? What expectations with respect to open access and access to data come with that funding?

Cooperative Agreement with the US Census Bureau, I don't believe there are any open data access requirements that go along with the funding

We commit to:

What data would you like to request?

Other data - more explanation

No response

Ironholds commented 7 months ago

@aflaxman is this an internal or 'real' request..? (or just a public documentation of a real, internal request! Which is awesome!)

aflaxman commented 7 months ago

It comes from inside the project, but it is a real request. Zeb is very motivated to figure out an appropriate way to use more than just the publicly available data in this workshop. I think this could be a good way to do it, but if this approach doesn't sound right to you, let's try to refine it.

zmbc commented 7 months ago

Yes, this is real.

I have had some more ideas since I initially wrote this--I think what I would like to do is share the data via Google Drive, and most workshop participants would use this shared version directly via Google Colab. We would un-share it after the workshop.

We would still allow participants to download the data locally if they needed to use software besides Python, with instructions as above to delete afterwards.

Also, we could easily limit the data we share to be only the years and datasets that are necessary for the workshop, though I don't know how much this matters.

Ironholds commented 6 months ago

That makes sense to me; I think for transparency reasons we'd want to have public documentation (ideally in this thread) who the workshop participants and so potential accessers (sp) are.

zmbc commented 6 months ago

@Ironholds I think one way to do it could be to put our "class roster" here on the day of the workshop. I'm not sure we'll know beforehand who will attend.

Ironholds commented 6 months ago

wfm!

aflaxman commented 4 months ago

Proposal approved, I'm closing this issue. :)

zmbc commented 3 weeks ago

The workshop was a success! Here was our class roster (these people gave consent for it to be shared publicly):

Angeliki Evripidou, Youth Futures Foundation, Senior Analysis Officer. Carl Frederick, Institute for Research on Poverty, University of Wisconsin-Madison David Grenier, Dir. Data Engineering, Rhode Island Longitudinal Data System (RILDS) Xindi Hu, Principal Data Scientist, Mathematica Amy Krefman, Northwestern University Anders Alexandersson, Florida Cancer Registry Charlotte Ma, ICES. Tara Whitten, Senior Analyst, Provincial Research Data Services, Alberta SPOR Support Unit Fei Jiang, The Ohio state University Nan Wang, ICES Jeremy Foxcroft, PhD Candidate, University of Guelph Todd Abraham, Asst. Director Data & Analytics at I2D2, Iowa State University Jan Savinc, Research Fellow, Edinburgh Napier University & Scottish Centre for Administrative Data Research Rod Middleton, Associate Professor Disease Registers, Swansea University Yinshan Zhao, Sr Data Scientist, Popdata BC Claire Tochel, Research Fellow, University of Edinburgh Timothy Nielsen, Postdoc Researcher, University of Sydney Rui Wang, Senior Data Scientist, Mathematica Tom Prendergast, The Health Foundation Tetyana Perchyk, Research Fellow, University of Surrey Winnie Shen, ICES Joseph Lam, PhD Student/Research Assistant, University College London, UK Jose Nova, Assoc. Director, Data & Analytics Rutgers University - IPHD Evelyn Lauren, PhD candidate, Boston University Shih Hao Lee, Staff Data Scientist, Intuitive Surgical Lili Wei, Researcher, University of Glasgow Ayaz Hyder, Data and Integration Lead, Smart Columbus/Community Information Exchange; Associate Professor, College of Public Health, Ohio State University Susan Burtner, Research Associate, Northwestern University

aflaxman commented 3 weeks ago

[celebrate] Abraham D Flaxman reacted to your message:


From: Zeb Burke-Conte @.> Sent: Monday, September 30, 2024 7:39:31 PM To: ihmeuw/pseudopeople @.> Cc: Abraham Flaxman @.>; State change @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: Workshop demonstrating linkage methods with large-ish data (Issue #394)

The workshop was a success! Here was our class roster (these people gave consent for it to be shared publicly):

Angeliki Evripidou, Youth Futures Foundation, Senior Analysis Officer. Carl Frederick, Institute for Research on Poverty, University of Wisconsin-Madison David Grenier, Dir. Data Engineering, Rhode Island Longitudinal Data System (RILDS) Xindi Hu, Principal Data Scientist, Mathematica Amy Krefman, Northwestern University Anders Alexandersson, Florida Cancer Registry Charlotte Ma, ICES. Tara Whitten, Senior Analyst, Provincial Research Data Services, Alberta SPOR Support Unit Fei Jiang, The Ohio state University Nan Wang, ICES Jeremy Foxcroft, PhD Candidate, University of Guelph Todd Abraham, Asst. Director Data & Analytics at I2D2, Iowa State University Jan Savinc, Research Fellow, Edinburgh Napier University & Scottish Centre for Administrative Data Research Rod Middleton, Associate Professor Disease Registers, Swansea University Yinshan Zhao, Sr Data Scientist, Popdata BC Claire Tochel, Research Fellow, University of Edinburgh Timothy Nielsen, Postdoc Researcher, University of Sydney Rui Wang, Senior Data Scientist, Mathematica Tom Prendergast, The Health Foundation Tetyana Perchyk, Research Fellow, University of Surrey Winnie Shen, ICES Joseph Lam, PhD Student/Research Assistant, University College London, UK Jose Nova, Assoc. Director, Data & Analytics Rutgers University - IPHD Evelyn Lauren, PhD candidate, Boston University Shih Hao Lee, Staff Data Scientist, Intuitive Surgical Lili Wei, Researcher, University of Glasgow Ayaz Hyder, Data and Integration Lead, Smart Columbus/Community Information Exchange; Associate Professor, College of Public Health, Ohio State University Susan Burtner, Research Associate, Northwestern University

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/ihmeuw/pseudopeople/issues/394*issuecomment-2384009090__;Iw!!K-Hz7m0Vt54!gMcTQw47ZtSJ56JCgqCIEqg-AQWIHgEYHeQwvvK6DWK0FWOWxPtaC10PvPuM9k5n4wMBGtghF3llaM433iCR$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAAMQJATCG2NRRM3SYPASQLZZGSHHAVCNFSM6AAAAABDIYRIKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBUGAYDSMBZGA__;!!K-Hz7m0Vt54!gMcTQw47ZtSJ56JCgqCIEqg-AQWIHgEYHeQwvvK6DWK0FWOWxPtaC10PvPuM9k5n4wMBGtghF3llaNFv60pR$. You are receiving this because you modified the open/close state.Message ID: @.***>