ihmeuw / pseudopeople

pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.
https://pseudopeople.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 2 forks source link

[Data access request]: large simulated datasets (1m and 300m) #413

Closed LacNguyen-Vidoori closed 2 months ago

LacNguyen-Vidoori commented 5 months ago

What is the name of your project?

Testing algorithms for Decennial Census record linking

What is the purpose of your project?

To assist the Census Bureau with the preparation/planning for System of Systems (SoS) Integration for the 2030 Decennial Census, our company, Vidoori Inc., is testing new algorithms that would help with linking Census records from multiple databases, in order to validate Household and Person records during Decennial operations.

Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?

Myself, Lac Nguyen, and my direct supervisor, Mr. Thomas George, head of Vidoori's Data Management department.

What funding is the project under? What expectations with respect to open access and access to data come with that funding?

This is an internal Vidoori project; myself and Mr. George will be the only 2 persons that access and execute testing with this dataset on a regular basis. No other Vidoori personnel will access nor use this dataset in any way, shape or form, despite this being funded by Vidoori.

We commit to:

What data would you like to request?

Other data - more explanation

No response

Ironholds commented 4 months ago

Thanks so much! To answer this request, could you provide (either directly or by pointing to other documents) more information about the System of Systems Integration project?

LacNguyen-Vidoori commented 4 months ago

Good afternoon from Suitland, MD! Apologies for the incompleteness of the questionnaire answers; below is the supplemental info:

The Census Person record-linking algorithms that we are planning to test & integrate using the datasets from pseudopeople are actually just a part of a much bigger Census initiatives for 2030: the Decennial Transformation & App Modernization (DTAM) program.

High-level objectives of this program are described here:

https://www.census.gov/programs-surveys/decennial-census/decade/2030/planning-management/plan/2030-census-contracts.html

Please let us know if this would suffice; we'd be more than happy to provide more details on the work otherwise.

Thank you,

Lac Nguyen 301-461-8914

Get Outlook for iOShttps://aka.ms/o0ukef


From: Os Keyes @.> Sent: Friday, May 10, 2024 1:47:48 PM To: ihmeuw/pseudopeople @.> Cc: Lac Nguyen @.>; Author @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

You don't often get email from @.*** Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification

Thanks so much! To answer this request, could you provide (either directly or by pointing to other documents) more information about the System of Systems Integration project?

— Reply to this email directly, view it on GitHubhttps://github.com/ihmeuw/pseudopeople/issues/413#issuecomment-2105023195, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM6MYFC6L4AKCZ3HBD37TIDZBUB4JAVCNFSM6AAAAABHJTFEC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVGAZDGMJZGU. You are receiving this because you authored the thread.Message ID: @.***>

Ironholds commented 4 months ago

Works for me @aflaxman

aflaxman commented 4 months ago

Super, I'll send out a download link. Please email me at abie@uw.edu to let me know where to send it.

aflaxman commented 4 months ago

Link sent and data successfully transfered!

LacNguyen-Vidoori commented 3 months ago

Good morning Abie - got this error upon attempting generate_decennial_census:

(we did specify the source parquet file's dir path according to the downloaded pseudopeople_simulated_population_usa_2_0_0 zip dir structure)

[cid:c260a220-7877-4338-8e13-28367d4c17a3]

Appreciate your guidance and support!

Regards,

Lac

____​​​​_____

From: Abraham Flaxman @.> Sent: Sunday, June 2, 2024 12:06 PM To: ihmeuw/pseudopeople @.> Cc: Lac Nguyen @.>; Author @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

Closed #413https://github.com/ihmeuw/pseudopeople/issues/413 as completed.

— Reply to this email directly, view it on GitHubhttps://github.com/ihmeuw/pseudopeople/issues/413#event-13012229030, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM6MYFHQKX66NGEPZECXMETZFM7HRAVCNFSM6AAAAABHJTFEC6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGAYTEMRSHEYDGMA. You are receiving this because you authored the thread.

LacNguyen-Vidoori commented 3 months ago

Sorry - forgot to add that this DataSourceError still persisted when we downgraded to older versions of pseudopeople and utilized the generate function (every single one of them, all the way back to 0.1.0)


From: Lac Nguyen @.> Sent: Tuesday, June 11, 2024 9:23 AM To: ihmeuw/pseudopeople @.>; ihmeuw/pseudopeople @.> Cc: Author @.>; Abraham D Flaxman @.>; Zhaojie Yin @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

Good morning Abie - got this error upon attempting generate_decennial_census:

(we did specify the source parquet file's dir path according to the downloaded pseudopeople_simulated_population_usa_2_0_0 zip dir structure)

[cid:c260a220-7877-4338-8e13-28367d4c17a3]

Appreciate your guidance and support!

Regards,

Lac

____​​​​_____

From: Abraham Flaxman @.> Sent: Sunday, June 2, 2024 12:06 PM To: ihmeuw/pseudopeople @.> Cc: Lac Nguyen @.>; Author @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

Closed #413https://github.com/ihmeuw/pseudopeople/issues/413 as completed.

— Reply to this email directly, view it on GitHubhttps://github.com/ihmeuw/pseudopeople/issues/413#event-13012229030, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM6MYFHQKX66NGEPZECXMETZFM7HRAVCNFSM6AAAAABHJTFEC6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGAYTEMRSHEYDGMA. You are receiving this because you authored the thread.

aflaxman commented 3 months ago

I'm sorry to hear this is not working for you! Let's see if we can get it sorted out. Can you share the exact code you used that generated this error?

LacNguyen-Vidoori commented 3 months ago

Yessir - here goes:

import pseudopeople as psp source_directory ="C:\...\pseudopeople\pseudopeople_simulated_population_usa_2_0_0\pseudopeople_simulated_population_usa_2_0_0"

source file is decennial_census_99.parquet

df = psp.generate_decennial_census(source=source_directory, config=psp.NO_NOISE, year=2020, engine='pandas')

Thank you,

Lac


From: Abraham Flaxman @.> Sent: Tuesday, June 11, 2024 11:22 AM To: ihmeuw/pseudopeople @.> Cc: Lac Nguyen @.>; Assign @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

I'm sorry to hear this is not working for you! Let's see if we can get it sorted out. Can you share the exact code you used that generated this error?

— Reply to this email directly, view it on GitHubhttps://github.com/ihmeuw/pseudopeople/issues/413#issuecomment-2161037664, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM6MYFGZA4QVUFXQ2CATO2TZG4I4FAVCNFSM6AAAAABJEK4XXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGAZTONRWGQ. You are receiving this because you were assigned.

aflaxman commented 3 months ago

Maybe there is an issue with your source_directory; the three dots in C:\\...\\pseudopeople looks suspicious to me. I recommend you try changing directory with a function from pure python and then using the psp.generate function once you have confirmed that the directory change has succeeded:

import os

source_directory ="C:\\...\\pseudopeople\\pseudopeople_simulated_population_usa_2_0_0\\pseudopeople_simulated_population_usa_2_0_0"
os.chdir(source_directory)

df = psp.generate_decennial_census(source='.', config=psp.NO_NOISE, year=2020, engine='pandas')

If there is something wrong with the source_directory string, this should raise FileNotFoundError when you try to os.chdir.

LacNguyen-Vidoori commented 3 months ago

Good morning Abie - again apologies for the tardiness of this response. Meant to let you know that we've figured out what the issue was: the CHANGELOG.rst file was not included in the source file directory.

(since we were using just one source parquet file out of the 334 as a test, we neglected to copy out the CHANGELOG.rst file to go with it)

Thank you again for your guidance and support!

Lac N.


From: Abraham Flaxman @.> Sent: Tuesday, June 11, 2024 11:53 AM To: ihmeuw/pseudopeople @.> Cc: Lac Nguyen @.>; Assign @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

Maybe there is an issue with your source_directory; the three dots in C:\...\pseudopeople looks suspicious to me. I recommend you try changing directory with a function from pure python and then using the psp.generate function once you have confirmed that the directory change has succeeded:

import os

source_directory ="C:\...\pseudopeople\pseudopeople_simulated_population_usa_2_0_0\pseudopeople_simulated_population_usa_2_0_0" os.chdir(source_directory)

df = psp.generate_decennial_census(source='.', config=psp.NO_NOISE, year=2020, engine='pandas')

If there is something wrong with the source_directory string, this should raise FileNotFoundError when you try to os.chdir.

— Reply to this email directly, view it on GitHubhttps://github.com/ihmeuw/pseudopeople/issues/413#issuecomment-2161102280, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM6MYFHYU3ILVH6OE3XDQNLZG4MN7AVCNFSM6AAAAABJEK4XXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGEYDEMRYGA. You are receiving this because you were assigned.

aflaxman commented 3 months ago

[like] Abraham D Flaxman reacted to your message:


From: LacNguyen-Vidoori @.> Sent: Friday, June 14, 2024 2:46:19 PM To: ihmeuw/pseudopeople @.> Cc: Abraham Flaxman @.>; State change @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

Good morning Abie - again apologies for the tardiness of this response. Meant to let you know that we've figured out what the issue was: the CHANGELOG.rst file was not included in the source file directory.

(since we were using just one source parquet file out of the 334 as a test, we neglected to copy out the CHANGELOG.rst file to go with it)

Thank you again for your guidance and support!

Lac N.


From: Abraham Flaxman @.> Sent: Tuesday, June 11, 2024 11:53 AM To: ihmeuw/pseudopeople @.> Cc: Lac Nguyen @.>; Assign @.> Subject: Re: [ihmeuw/pseudopeople] [Data access request]: large simulated datasets (1m and 300m) (Issue #413)

Maybe there is an issue with your source_directory; the three dots in C:\...\pseudopeople looks suspicious to me. I recommend you try changing directory with a function from pure python and then using the psp.generate function once you have confirmed that the directory change has succeeded:

import os

source_directory ="C:\...\pseudopeople\pseudopeople_simulated_population_usa_2_0_0\pseudopeople_simulated_population_usa_2_0_0" os.chdir(source_directory)

df = psp.generate_decennial_census(source='.', config=psp.NO_NOISE, year=2020, engine='pandas')

If there is something wrong with the source_directory string, this should raise FileNotFoundError when you try to os.chdir.

— Reply to this email directly, view it on GitHubhttps://github.com/ihmeuw/pseudopeople/issues/413#issuecomment-2161102280https://urldefense.com/v3/__https://github.com/ihmeuw/pseudopeople/issues/413*issuecomment-2161102280*3E__;IyU!!K-Hz7m0Vt54!mnJ1oh0Sz-QS8FNsvPardeXbieFX3u-KWS-ec_44OMcJKGICn7pI2CeVvHnoDmsZmrTK56evZH1pH6G8kJs7$, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM6MYFHYU3ILVH6OE3XDQNLZG4MN7AVCNFSM6AAAAABJEK4XXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGEYDEMRYGAhttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AM6MYFHYU3ILVH6OE3XDQNLZG4MN7AVCNFSM6AAAAABJEK4XXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGEYDEMRYGA*3E__;JQ!!K-Hz7m0Vt54!mnJ1oh0Sz-QS8FNsvPardeXbieFX3u-KWS-ec_44OMcJKGICn7pI2CeVvHnoDmsZmrTK56evZH1pH0STHMPf$. You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/ihmeuw/pseudopeople/issues/413*issuecomment-2168199799__;Iw!!K-Hz7m0Vt54!mnJ1oh0Sz-QS8FNsvPardeXbieFX3u-KWS-ec_44OMcJKGICn7pI2CeVvHnoDmsZmrTK56evZH1pH8etpZPc$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAAMQJCXFCLH4B2ENLG7GGLZHL63XAVCNFSM6AAAAABJEK4XXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRYGE4TSNZZHE__;!!K-Hz7m0Vt54!mnJ1oh0Sz-QS8FNsvPardeXbieFX3u-KWS-ec_44OMcJKGICn7pI2CeVvHnoDmsZmrTK56evZH1pHw0KEQG0$. You are receiving this because you modified the open/close state.Message ID: @.***>

Ironholds commented 2 months ago

@aflaxman is this done/should I close it?

aflaxman commented 2 months ago

I believe so! @LacNguyen-Vidoori : please don't hesitate to re-open if you have additional points to discuss. :)