The code in this replication package reconstructs microdata from a subset of 2010 Census Summary File 1 tables, links those reconstructed data to commercial data and internal 2010 Census data containing personally identifiable information, determines if such links constitute reidentification, and computes statistics related to the reconstruction and reidentification.
The code uses a combination of Python, Gurobi™, SQL, Stata, SAS, and bash
scripts. Production runs of this software were performed on Amazon Web
Services (AWS) Elastic Map Reduce (EMR) clusters and AWS Elastic Compute Cloud
(EC2) instances. Using a cluster of 30 r5.24xlarge
nodes, the reconstruction
step takes approximately 3 full days per run. Using a cluster of 25
r5.24xlarge
nodes, the solution variability analysis takes approximately 14
days. Using a single r5.24xlarge
node, the reidentification step takes
approximately 14 days.
In the instructions for running the software, terms contained
within angle brackets (e.g. <term>
) are to be substituted by the user. Terms
beginning with a dollar sign (e.g. ${term}
) are environment variables and may be copied as-is
to execute code.
This project uses both publicly available and confidential data as inputs. The publicly available data consist of the 2010 Census Summary 1 File (2010 SF1) tabulations, which are available at:
https://www2.census.gov/census_2010/04-Summary_File_1/
The necessary files are the zipped 2010 SF1 tables, with filenames
<state>/<st>2010.sf1.zip
. This project used data for all 50 states and the
District of Columbia.
The confidential data consist of:
The confidential data extracts used by the reidentifiation code are stored in
an AWS S3 bucket at:
${DAS_S3ROOT}/recon_replication/CUI__SP_CENS_T13_recon_replication_data_20231215.zip
where $DAS_S3ROOT
is an environment variable giving the location of the
relevant bucket[^1].
The underlying confidential data that serve as the source of the extracts are available inside the Census Enterprise Data Lake. These confidential data have been available at the Census Bureau for the past 12 years, and are expected to be available for at least another 10 years. The original locations outside of AWS are documented in DMS Project P-7502798.
[^1]: The DAS_S3ROOT
environment variable is correctly set in properly configured DAS EC2 instances
Data contained within the replication package are covered under project #P-7502798 in the Census Bureau Data Management System (DMS). Publicly released outputs made from this project were approved for release by the Census Bureau Disclosure Review Board (DRB) under the following DRB approval numbers:
The results of this research rely on both publicly available 2010 Census tabulations and confidential microdata from the 2010 Census and commercial databases. Access to the confidential data is limited to Census Bureau employees and those others with Special Sworn Status who have a work-related need to access the data and are a listed contributor for project P-7502798 in the Census Bureau's Data Management System.
Data Source | Access |
---|---|
2010 Summary File 1 | Publicly available |
2010 Census Edited File (CEF) Extract | Confidential |
2010 Hundred Percent Detail File (HDF) Extract | Confidential |
2010 DAS Experiment 23.1 Microdata Detail File (MDF)[^2] | Confidential |
2010 DAS Experiment 23.1 Reconstructed Microdata Detail File (rMDF)[^2] | Confidential |
Merged Commercial Data[^3]: | Confidential |
[^2]: The original MDF for DAS experiment 23.1 was unintentionally deleted. This prevents replication of the MDF-based results from the source file; however, the reformatted files needed for reidentification are available and provided with the other protected data files. The authors will update code and results to use the publicly released April 3, 2023 privacy protected microdata file (PPMF), which used the same DAS settings as experiment 23.1.
[^3]: Although the commercial data come from multiple vendors, those data were harmonized and merged into a single file for use in reidentifiation
Data Source | File | Storage Format | Data Format | Data Dictionary |
---|---|---|---|---|
2010 Summary File 1 | <st>2010.sf1.zip |
zip | fixed-width | 2010 SF1 Documentation |
2010 Census Edited File (CEF) State Extracts | cef<st><cty>.csv |
csv | csv | recon_replication/doc/cef_dict.md |
2010 Census Edited File (CEF) Persons Extract for Swapping | swap_pcef.csv |
csv | csv | recon_replication/doc/swap_pcef_dict.md |
2010 Census Edited File (CEF) Housing Extract for Swapping, CSV | swap_hcef.csv |
csv | csv | recon_replication/doc/swap_hcef_dict.md |
2010 Census Edited File (CEF) Housing Extract for Swapping, SAS | swap_hcef.sas7bdat |
sas7bdat | sas7bdat | recon_replication/doc/swap_hcef_dict.md |
2010 Hundred Percent Detail File (HDF) Extract | hdf<st><cty>.csv |
csv | csv | recon_replication/doc/hdf_dict.md |
2010 DAS Experiment 23.1 Reconstructed Microdata Detail File (rMDF) | r02<st><cty>.csv |
csv | csv | recon_replication/doc/mdf_dict.md |
2010 DAS Experiment 23.1 Microdata Detail File (MDF) | r03<st><cty>.csv |
csv | csv | recon_replication/doc/mdf_dict.md |
2010 Swap Experiment HI Reconstructed Microdata Detail File (rMDF) | r04<st><cty>.csv |
csv | csv | recon_replication/doc/mdf_dict.md |
2010 Swap Experiment LO Reconstructed Microdata Detail File (rMDF) | r05<st><cty>.csv |
csv | csv | recon_replication/doc/mdf_dict.md |
Merged Commercial Data | cmrcl<st><cty>.csv |
csv | csv | recon_replication/doc/cmrcl_dict.md |
The server initially housing both data and code for the reconstruction and reidentification experiments no longer exists. In transitioning to a new computational environment, the individual commercial data assets used to generate the merged commercial data in the dataset list were not maintained in a way to guarantee versioning. As such, the merged commercial data that was maintained is considered to be the original input file for the purposes of this replication archive. The following list gives information on the original commercial assets:
Data Source | File | Storage Format | Data Format | Data Dictionary |
---|---|---|---|---|
2010 Experian Research File | exp_edr2010.sas7bdat |
sas7bdat | SAS V8+ | recon_replication/doc/cmrcl_exp.md |
2010 InfoUSA Research File | infousa_jun2010.sas7bdat |
sas7bdat | SAS V8+ | recon_replication/doc/cmrcl_infousa.md |
2010 Targus Fed Consumer Research File | targus_fedconsumer2010.sas7bdat |
sas7bdat | SAS V8+ | recon_replication/doc/cmrcl_targus.md |
2010 VSGI Research File | vsgi_nar2010.sas7bdat |
sas7bdat | SAS V8+ | recon_replication/doc/cmrcl_vsgi.md |
Instructions for re-executing the reconstruction and solution variability code in this replication package assumes access to an AWS cluster, AWS S3 storage, and a MySQL server for job scheduling.
Instructions for re-executing the reidentification code, which uses data protected under Title 13, U.S.C., assume access to the U.S. Census Bureau's Enterprise Environment. Documenting the computer setup for the Census Bureau's Enterprise environment is beyond the scope of this document, for security reasons.
The documentation above is accurate as of December 15, 2023.
recon_replication/requirements.txt
Randomness for the various matching experiments in reidentification is controlled by columns of stored uniform draws in the CEF and commercial datasets.
At default settings, and at times due to unexpected bugs in its closed-source code, the Gurobi™ solver used for reconstruction exhibits can exhibit some mild non-determinism resulting in small differences in the published results and the results from replication.
Approximate time needed to reproduce the analyses on a standard (2022) desktop machine:
The reconstruction code was last run on a 30 node AWS r5.24xlarge
cluster.
Computation took approximiately 3 days for each set of input tables. The
solution variability was last run on a 25 node AWS r5.24xlarge
cluster.
Computation took approximatley 2 weeks. The reidentifiation code was last run
on a single AWS r5.24xlarge
node. Computation took approximately 2 weeks.
Each r5.24xlarge
node has 96 vCPUs and 768GiB of memory.
Reconstruction of the 2010 HDF via the publicly-available 2010 SF1 table files and the computation of subsequent solution variability measures do not require access to the Census Bureau's Enterprise Environment. The instructions below assume access to Amazon Web Services (AWS), a cluster similar in size to the environment details above, an S3 bucket to hold the necessary SF1 input tabulations, and the necessary Python packages. Additionally, these steps require a license to use the Gurobi™ optimization software; a free academic license is available.
Reidentification of a reconstructed 2010 HDF file (rHDF) requires access to
sensitive data assets given in the dataset list. The
instructions below assume access to those data, a server within the Census
Enterprise environment with resources on par with a single AWS EC2 r5.24xlarge
node, and that the necessary Python packages have been
installed. If the rHDF and solution variability results were created outside
the Census Enterprise environment, then the replicator will need to work with
Census staff to have their data files ingested.
Access to AWS requires creation of an account. Once the account is created, replicators should follow instructions for creating an AWS EMR cluster.
Reconstruction via an AWS cluster requires that the necessary SF1 input files exist within an AWS Simple Storage Service (S3) bucket. Replicators should follow instructions for creating an S3 bucket.
The reconstruction software uses SQL, via MySQL, to manage the workload across the AWS cluster.
Replicators should follow instructions for creating a MySQL server.
The instructions below will assume that replicators are installing MySQL on the master node of
the AWS cluster, but replicators may choose to have a dedicated AWS EMR or EC2 instance
for the MySQL server if they prefer. Then setup the desired database using the provided schema: recon_replication/recon/schema_common.sql
This can be done with the following command:
mysql -u <ROOT_USERNAME> -p <DB_NAME> < recon_replication/recon/schema_common.sql
recon_replication/recon
The instructions assume that the user will store reconstruction results
in an AWS S3 bucket <S3ROOT>
ssh -A <aws_user>@<cluster master address>
git clone git@github.com:uscensusbureau/recon_replication.git
cd recon_replication
git pull
git submodule update --init --recursive
cd ~
ln -s recon_replication/recon
main
branch
cd recon
git checkout main
MYSQL_HOST: <MYSQL Hostname>
MYSQL_DATABASE: <MYSQL Database Name>
MYSQL_USER: <MYSQL Username>
MYSQL_PASSWORD: <MYSQL Password>
DAS_S3ROOT: <aws location to load/read files>
GUROBI_HOME: <Gurobi™ home>
GRB_APP_NAME: <Gurobi™ App Name>
GRB_LICENSE_FILE: <Gurobi™ license file location>
GRB_ISV_NAME: <Gurobi™ ISV name>
BCC_HTTPS_PROXY: <BCC HTTPS proxy (may not be needed for release)>
BCC_HTTP_PROXY : <BCC HTTP proxy (may not be needed for release)>
AWS_DEFAULT_REGION : <DEFAULT AWS REGION ex: us-gov-west-1>
DAS_ENVIROMENT : <DAS enviroment ex: ITECB>
$(./dbrtool.py --env)
./dbrtool.py --reident hdf_bt --register
python s0_download_data.py --reident hdf_bt --all
aws s3 cp 2010-re/hdf_bt/dist/ <S3_ROOT>/2010-re/hdf_bt/dist/ --recursive
./dbrtool.py --reident hdf_bt --step1 --latin1
./dbrtool.py --reident hdf_bt --step2
./dbrtool.py --reident hdf_bt --launch_all
./dbrtool.py --reident hdf_bt --status
./dbrtool.py --reident hdf_bt --launch_all
./dbrtool.py --reident hdf_bt --runbg --step5 --step6
aws s3 ls <S3_ROOT>/2010-re/hdf_bt/rhdf_bt.zip
./dbrtool.py --reident hdf_b --register
aws s3 cp 2010-re/hdf_bt/dist/ <S3_ROOT>/2010-re/hdf_b/dist/ --recursive
./dbrtool.py --reident hdf_b --step1 --latin1
./dbrtool.py --reident hdf_b --step2
blockonly
branch of the recon_replication
repository
./dbrtool.py --reident hdf_b --launch_all --branch blockonly
./dbrtool.py --reident hdf_b --status
./dbrtool.py --reident hdf_b --launch_all --branch blockonly
./dbrtool.py --reident hdf_b --runbg --step5 --step6
aws s3 ls <S3_ROOT>/2010-re/hdf_b/rhdf_b.zip
cd ~/recon/solution_variability
config.ini
file, add the AWS S3 bucket name to the end of this line: s3Bucket =
export SPARK_HOME=/usr/lib/spark && export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-src.zip
setsid python block_level_rewriter.py -t text -i <S3ROOT>/2010-re/hdf_bt/work -o solvar/hdf_bt/2010-block-results &> rewriter_out.txt
export SPARK_HOME=/usr/lib/spark && export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-src.zip
python -m solvar -d -i solvar/hdf_bt/2010-block-results -o solvar/hdf_b/solvar-out-block --age --demo &> solvar_out_$(date +"%FT%H%M").txt
extract
folder
cd ~/recon_replication
aws s3 cp <S3ROOT>/2010-re/hdf_bt/rhdf_bt.zip .
unzip -j rhdf_bt.zip
python extract_tracts.py rhdf_bt.csv
aws s3 cp rhdf_bt_0solvar_extract.csv <S3ROOT>/2010-re/hdf_bt/
At the point where the reconstructed HDF files for both experiments and the solution variability results are created and copied into S3, the cluster may be shutdown.
The user must work with Census Bureau staff to ingest any publicly
created files into the Census Enterprise Environment. These instructions
will assume that the files are in an AWS S3 bucket <CROOT>
accessible from that
environment.
ssh -A <server address>
export workdir=<workdir>
export CROOT=<CROOT>
mkdir -P ${workdir}
cd ${workdir}
git clone git@github.com:uscensusbureau/recon_replication.git
${workdir}/data/reid_module
on EC2 instance:
mkdir ${workdir}/data/reid_module/
aws s3 cp ${DAS_S3ROOT}/recon_replication/CUI__SP_CENS_T13_recon_replication_data_20230426.zip ${workdir}
unzip -d ${workdir} ${workdir}/CUI__SP_CENS_T13_recon_replication_data_20230426.zip
aws s3 cp ${CROOT}/solvar/scaled_ivs.csv ${workdir}/data/reid_module/solvar/
cd ${workdir}/data/reid_module/rhdf/r00/
aws s3 cp ${CROOT}/2010-re/hdf_bt/rhdf_bt.csv.zip .
unzip -j rhdf_bt.csv.zip
ln -s rhdf_bt.csv r00.csv
cd ${workdir}/data/reid_module/rhdf/r01/
aws s3 cp ${CROOT}/2010-re/hdf_b/rhdf_b.csv.zip .
unzip -j rhdf_b.csv.zip
ln -s rhdf_b.csv r01.csv
<workdir>
cd ${workdir}/recon_replication/reidmodule/
setsid /usr/bin/python3 runreid.py 40 r00
cd ${workdir}/recon_replication/reidpaper/programs/
setsid stata-se -b runall.do
cd ${workdir}/recon_replication/reidpaper/results/
${workdir}/recon_replication/reidpaper/results/CBDRB-FY22-DSEP-004/CBDRB-FY22-DSEP-004.xlsx
cd ${workdir}/recon_replication/results/
in
folder:
ln -s ${workdir}/recon_replication/reidpaper/results/CBDRB-FY22-DSEP-004/CBDRB-FY22-DSEP-004.xlsx in/CBDRB-FY22-DSEP-004.xlsx
stata-se -b make_tables.do
cd ${workdir}/recon_replication/results/out/
cd ${workdir}/recon_replication/suppression/
python recode.py
python suppression.py > suppression_results.txt
${workdir}/recon_replication/suppression/suppresion_results.txt
[^1] Replication results for swapping are obtained by using the reconstructed swap files found in the dataset list.
cd ${workdir}/recon_replication/reid_swap/
python pairs_driver.py
python swap.py
${workdir}/recon_replication/reid_swap/LO/swapped_us.csv
${workdir}/recon_replication/reid_swap/HI/swapped_us.csv
cd ${workdir}/recon_replication/metrics/
python recode.py --infile ${workdir}/data/reid_module/cef/cef.csv --outfile cef.csv --cef
python metrics.py -c HIconfig.yml
python metrics.py -c LOconfig.yml
python tables.py -v r04 -r False
python tables.py -v r05 -r False
${workdir}/recon_replication/metrics/output/r04/metrics_r04.xlsx
${workdir}/recon_replication/metrics/output/r05/metrics_r05.xlsx
This replication archive reproduces tabular results listed in the [accompanying spreadsheet](<manuscript/hdsr/20231214-HDSR submission tables and figures.xlsx>)