uscensusbureau / recon_replication

Other
2 stars 0 forks source link

Overview

The code in this replication package reconstructs microdata from a subset of 2010 Census Summary File 1 tables, links those reconstructed data to commercial data and internal 2010 Census data containing personally identifiable information, determines if such links constitute reidentification, and computes statistics related to the reconstruction and reidentification.

The code uses a combination of Python, Gurobi™, SQL, Stata, SAS, and bash scripts. Production runs of this software were performed on Amazon Web Services (AWS) Elastic Map Reduce (EMR) clusters and AWS Elastic Compute Cloud (EC2) instances. Using a cluster of 30 r5.24xlarge nodes, the reconstruction step takes approximately 3 full days per run. Using a cluster of 25 r5.24xlarge nodes, the solution variability analysis takes approximately 14 days. Using a single r5.24xlarge node, the reidentification step takes approximately 14 days.

In the instructions for running the software, terms contained within angle brackets (e.g. <term>) are to be substituted by the user. Terms beginning with a dollar sign (e.g. ${term}) are environment variables and may be copied as-is to execute code.

Data availability and provenance statements

This project uses both publicly available and confidential data as inputs. The publicly available data consist of the 2010 Census Summary 1 File (2010 SF1) tabulations, which are available at:

https://www2.census.gov/census_2010/04-Summary_File_1/

The necessary files are the zipped 2010 SF1 tables, with filenames <state>/<st>2010.sf1.zip. This project used data for all 50 states and the District of Columbia.

The confidential data consist of:

The confidential data extracts used by the reidentifiation code are stored in an AWS S3 bucket at: ${DAS_S3ROOT}/recon_replication/CUI__SP_CENS_T13_recon_replication_data_20231215.zip where $DAS_S3ROOT is an environment variable giving the location of the relevant bucket[^1].

The underlying confidential data that serve as the source of the extracts are available inside the Census Enterprise Data Lake. These confidential data have been available at the Census Bureau for the past 12 years, and are expected to be available for at least another 10 years. The original locations outside of AWS are documented in DMS Project P-7502798.

[^1]: The DAS_S3ROOT environment variable is correctly set in properly configured DAS EC2 instances

Statement about rights

Data contained within the replication package are covered under project #P-7502798 in the Census Bureau Data Management System (DMS). Publicly released outputs made from this project were approved for release by the Census Bureau Disclosure Review Board (DRB) under the following DRB approval numbers:

Summary of availability

Details on each data source

The results of this research rely on both publicly available 2010 Census tabulations and confidential microdata from the 2010 Census and commercial databases. Access to the confidential data is limited to Census Bureau employees and those others with Special Sworn Status who have a work-related need to access the data and are a listed contributor for project P-7502798 in the Census Bureau's Data Management System.

Data Source Access
2010 Summary File 1 Publicly available
2010 Census Edited File (CEF) Extract Confidential
2010 Hundred Percent Detail File (HDF) Extract Confidential
2010 DAS Experiment 23.1 Microdata Detail File (MDF)[^2] Confidential
2010 DAS Experiment 23.1 Reconstructed Microdata Detail File (rMDF)[^2] Confidential
Merged Commercial Data[^3]: Confidential

[^2]: The original MDF for DAS experiment 23.1 was unintentionally deleted. This prevents replication of the MDF-based results from the source file; however, the reformatted files needed for reidentification are available and provided with the other protected data files. The authors will update code and results to use the publicly released April 3, 2023 privacy protected microdata file (PPMF), which used the same DAS settings as experiment 23.1.

[^3]: Although the commercial data come from multiple vendors, those data were harmonized and merged into a single file for use in reidentifiation

Dataset list

Data Source File Storage Format Data Format Data Dictionary
2010 Summary File 1 <st>2010.sf1.zip zip fixed-width 2010 SF1 Documentation
2010 Census Edited File (CEF) State Extracts cef<st><cty>.csv csv csv recon_replication/doc/cef_dict.md
2010 Census Edited File (CEF) Persons Extract for Swapping swap_pcef.csv csv csv recon_replication/doc/swap_pcef_dict.md
2010 Census Edited File (CEF) Housing Extract for Swapping, CSV swap_hcef.csv csv csv recon_replication/doc/swap_hcef_dict.md
2010 Census Edited File (CEF) Housing Extract for Swapping, SAS swap_hcef.sas7bdat sas7bdat sas7bdat recon_replication/doc/swap_hcef_dict.md
2010 Hundred Percent Detail File (HDF) Extract hdf<st><cty>.csv csv csv recon_replication/doc/hdf_dict.md
2010 DAS Experiment 23.1 Reconstructed Microdata Detail File (rMDF) r02<st><cty>.csv csv csv recon_replication/doc/mdf_dict.md
2010 DAS Experiment 23.1 Microdata Detail File (MDF) r03<st><cty>.csv csv csv recon_replication/doc/mdf_dict.md
2010 Swap Experiment HI Reconstructed Microdata Detail File (rMDF) r04<st><cty>.csv csv csv recon_replication/doc/mdf_dict.md
2010 Swap Experiment LO Reconstructed Microdata Detail File (rMDF) r05<st><cty>.csv csv csv recon_replication/doc/mdf_dict.md
Merged Commercial Data cmrcl<st><cty>.csv csv csv recon_replication/doc/cmrcl_dict.md

Commercial data provenance

The server initially housing both data and code for the reconstruction and reidentification experiments no longer exists. In transitioning to a new computational environment, the individual commercial data assets used to generate the merged commercial data in the dataset list were not maintained in a way to guarantee versioning. As such, the merged commercial data that was maintained is considered to be the original input file for the purposes of this replication archive. The following list gives information on the original commercial assets:

Data Source File Storage Format Data Format Data Dictionary
2010 Experian Research File exp_edr2010.sas7bdat sas7bdat SAS V8+ recon_replication/doc/cmrcl_exp.md
2010 InfoUSA Research File infousa_jun2010.sas7bdat sas7bdat SAS V8+ recon_replication/doc/cmrcl_infousa.md
2010 Targus Fed Consumer Research File targus_fedconsumer2010.sas7bdat sas7bdat SAS V8+ recon_replication/doc/cmrcl_targus.md
2010 VSGI Research File vsgi_nar2010.sas7bdat sas7bdat SAS V8+ recon_replication/doc/cmrcl_vsgi.md

Computational requirements

Instructions for re-executing the reconstruction and solution variability code in this replication package assumes access to an AWS cluster, AWS S3 storage, and a MySQL server for job scheduling.

Instructions for re-executing the reidentification code, which uses data protected under Title 13, U.S.C., assume access to the U.S. Census Bureau's Enterprise Environment. Documenting the computer setup for the Census Bureau's Enterprise environment is beyond the scope of this document, for security reasons.

The documentation above is accurate as of December 15, 2023.

Software requirements

Controlled randomness

Randomness for the various matching experiments in reidentification is controlled by columns of stored uniform draws in the CEF and commercial datasets.

At default settings, and at times due to unexpected bugs in its closed-source code, the Gurobi™ solver used for reconstruction exhibits can exhibit some mild non-determinism resulting in small differences in the published results and the results from replication.

Summary

Approximate time needed to reproduce the analyses on a standard (2022) desktop machine:

Details

The reconstruction code was last run on a 30 node AWS r5.24xlarge cluster. Computation took approximiately 3 days for each set of input tables. The solution variability was last run on a 25 node AWS r5.24xlarge cluster. Computation took approximatley 2 weeks. The reidentifiation code was last run on a single AWS r5.24xlarge node. Computation took approximately 2 weeks. Each r5.24xlarge node has 96 vCPUs and 768GiB of memory.

Description of programs/code

Reconstruction of the 2010 HDF via the publicly-available 2010 SF1 table files and the computation of subsequent solution variability measures do not require access to the Census Bureau's Enterprise Environment. The instructions below assume access to Amazon Web Services (AWS), a cluster similar in size to the environment details above, an S3 bucket to hold the necessary SF1 input tabulations, and the necessary Python packages. Additionally, these steps require a license to use the Gurobi™ optimization software; a free academic license is available.

Reidentification of a reconstructed 2010 HDF file (rHDF) requires access to sensitive data assets given in the dataset list. The instructions below assume access to those data, a server within the Census Enterprise environment with resources on par with a single AWS EC2 r5.24xlarge node, and that the necessary Python packages have been installed. If the rHDF and solution variability results were created outside the Census Enterprise environment, then the replicator will need to work with Census staff to have their data files ingested.

System setup

AWS EMR cluster creation

Access to AWS requires creation of an account. Once the account is created, replicators should follow instructions for creating an AWS EMR cluster.

S3 bucket creation

Reconstruction via an AWS cluster requires that the necessary SF1 input files exist within an AWS Simple Storage Service (S3) bucket. Replicators should follow instructions for creating an S3 bucket.

MySQL server setup

The reconstruction software uses SQL, via MySQL, to manage the workload across the AWS cluster. Replicators should follow instructions for creating a MySQL server. The instructions below will assume that replicators are installing MySQL on the master node of the AWS cluster, but replicators may choose to have a dedicated AWS EMR or EC2 instance for the MySQL server if they prefer. Then setup the desired database using the provided schema: recon_replication/recon/schema_common.sql
This can be done with the following command: mysql -u <ROOT_USERNAME> -p <DB_NAME> < recon_replication/recon/schema_common.sql

List of provided metric results

List of software files

Reconstruction recon_replication/recon

Solution variability

Zero solution variability census tract extraction

MDF and PPMF conversion

Reidentification

Table creation

Suppression

Swapping

Metrics

Instructions

Reconstruction using census block- and census tract-level SF1 tables

The instructions assume that the user will store reconstruction results in an AWS S3 bucket <S3ROOT>

  1. Log into the AWS cluster
    • ssh -A <aws_user>@<cluster master address>
  2. Clone reconstruction repository into user home directory
    • git clone git@github.com:uscensusbureau/recon_replication.git
  3. Pull and update submodules
    • cd recon_replication
    • git pull
    • git submodule update --init --recursive
  4. Link the reconstruction directory for convenience
    • cd ~
    • ln -s recon_replication/recon
  5. Change to recon directory and ensure that you are on the main branch
    • cd recon
    • git checkout main
  6. Fill out recon_replication/recon/dbrecon_config.json
    • MYSQL_HOST: <MYSQL Hostname>
    • MYSQL_DATABASE: <MYSQL Database Name>
    • MYSQL_USER: <MYSQL Username>
    • MYSQL_PASSWORD: <MYSQL Password>
    • DAS_S3ROOT: <aws location to load/read files>
    • GUROBI_HOME: <Gurobi™ home>
    • GRB_APP_NAME: <Gurobi™ App Name>
    • GRB_LICENSE_FILE: <Gurobi™ license file location>
    • GRB_ISV_NAME: <Gurobi™ ISV name>
    • BCC_HTTPS_PROXY: <BCC HTTPS proxy (may not be needed for release)>
    • BCC_HTTP_PROXY : <BCC HTTP proxy (may not be needed for release)>
    • AWS_DEFAULT_REGION : <DEFAULT AWS REGION ex: us-gov-west-1>
    • DAS_ENVIROMENT : <DAS enviroment ex: ITECB>
  7. Setup environment variables
    • $(./dbrtool.py --env)
  8. Create new reconstruction experiment and create database tables
    • ./dbrtool.py --reident hdf_bt --register
  9. Download SF1 tables and copy to S3
    • python s0_download_data.py --reident hdf_bt --all
    • aws s3 cp 2010-re/hdf_bt/dist/ <S3_ROOT>/2010-re/hdf_bt/dist/ --recursive
  10. Run step1 to create geography files
    • ./dbrtool.py --reident hdf_bt --step1 --latin1
  11. Run step2 to ingest SF1 tables
    • ./dbrtool.py --reident hdf_bt --step2
  12. Resize cluster to 30 core nodes using the directions above
  13. Run steps 3 & 4 to create LP and SOL files for reconstruction
    • ./dbrtool.py --reident hdf_bt --launch_all
  14. Check on status of completed census tracts
    • ./dbrtool.py --reident hdf_bt --status
  15. Relaunch idle clusters after 1 day
    • ./dbrtool.py --reident hdf_bt --launch_all
  16. Run steps 5 & 6 to produce microdata (rHDF)
    • ./dbrtool.py --reident hdf_bt --runbg --step5 --step6
  17. Verify that microdata was copied to S3 bucket
    • aws s3 ls <S3_ROOT>/2010-re/hdf_bt/rhdf_bt.zip

Reconstruction using census block-level SF1 tables only

  1. Register new reconstruction experiment and create database tables
    • ./dbrtool.py --reident hdf_b --register
  2. Copy SF1 tables to S3
    • aws s3 cp 2010-re/hdf_bt/dist/ <S3_ROOT>/2010-re/hdf_b/dist/ --recursive
  3. Run step1 to create geography files
    • ./dbrtool.py --reident hdf_b --step1 --latin1
  4. Run step2 to ingest SF1 tables
    • ./dbrtool.py --reident hdf_b --step2
  5. Resize cluster to 30 core nodes
  6. Run steps 3 & 4 to create LP and SOL files for reconstruction, using blockonly branch of the recon_replication repository
    • ./dbrtool.py --reident hdf_b --launch_all --branch blockonly
  7. Check on status of completed census tracts
    • ./dbrtool.py --reident hdf_b --status
  8. Relaunch idle clusters after 1 day
    • ./dbrtool.py --reident hdf_b --launch_all --branch blockonly
  9. Run steps 5 & 6 to produce microdata (rHDF)
    • ./dbrtool.py --reident hdf_b --runbg --step5 --step6
  10. Verify that microdata was copied to S3 bucket
    • aws s3 ls <S3_ROOT>/2010-re/hdf_b/rhdf_b.zip

Solution variability

  1. Change to the solution variability folder
    • cd ~/recon/solution_variability
  2. In the config.ini file, add the AWS S3 bucket name to the end of this line: s3Bucket =
  3. Run the splitter
    • export SPARK_HOME=/usr/lib/spark && export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-src.zip
    • setsid python block_level_rewriter.py -t text -i <S3ROOT>/2010-re/hdf_bt/work -o solvar/hdf_bt/2010-block-results &> rewriter_out.txt
  4. Run solution variability module
    • export SPARK_HOME=/usr/lib/spark && export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-src.zip
    • python -m solvar -d -i solvar/hdf_bt/2010-block-results -o solvar/hdf_b/solvar-out-block --age --demo &> solvar_out_$(date +"%FT%H%M").txt

Run extract of zero-solution-variability census tracts

  1. Change to the extract folder
    • cd ~/recon_replication
  2. Copy the reconstructed HDF file for the census block-tract experiment to the directory and unzip
    • aws s3 cp <S3ROOT>/2010-re/hdf_bt/rhdf_bt.zip .
    • unzip -j rhdf_bt.zip
  3. Run the extraction
    • python extract_tracts.py rhdf_bt.csv
  4. Copy the census tract extract to S3 if desired
    • aws s3 cp rhdf_bt_0solvar_extract.csv <S3ROOT>/2010-re/hdf_bt/

Shutdown AWS cluster

At the point where the reconstructed HDF files for both experiments and the solution variability results are created and copied into S3, the cluster may be shutdown.

Reidentification

The user must work with Census Bureau staff to ingest any publicly created files into the Census Enterprise Environment. These instructions will assume that the files are in an AWS S3 bucket <CROOT> accessible from that environment.

  1. Log into an appropriate server in the Census Enterprise environment
    • ssh -A <server address>
  2. Create envrionment variables for convenience
    • export workdir=<workdir>
    • export CROOT=<CROOT>
  3. Clone reconstruction repository into user work directory
    • mkdir -P ${workdir}
    • cd ${workdir}
    • git clone git@github.com:uscensusbureau/recon_replication.git
  4. Copy and extract confidential data to ${workdir}/data/reid_module on EC2 instance:
    • mkdir ${workdir}/data/reid_module/
    • aws s3 cp ${DAS_S3ROOT}/recon_replication/CUI__SP_CENS_T13_recon_replication_data_20230426.zip ${workdir}
    • unzip -d ${workdir} ${workdir}/CUI__SP_CENS_T13_recon_replication_data_20230426.zip
  5. Copy solution variability results to required location
    • aws s3 cp ${CROOT}/solvar/scaled_ivs.csv ${workdir}/data/reid_module/solvar/
  6. Copy rHDFs from S3 bucket, extract, and create necessary links
    • cd ${workdir}/data/reid_module/rhdf/r00/
    • aws s3 cp ${CROOT}/2010-re/hdf_bt/rhdf_bt.csv.zip .
    • unzip -j rhdf_bt.csv.zip
    • ln -s rhdf_bt.csv r00.csv
    • cd ${workdir}/data/reid_module/rhdf/r01/
    • aws s3 cp ${CROOT}/2010-re/hdf_b/rhdf_b.csv.zip .
    • unzip -j rhdf_b.csv.zip
    • ln -s rhdf_b.csv r01.csv
  7. Edit the configuration file to point to the working directory by modifying occurrences of <workdir>
  8. Run first stage of reidentification
    • cd ${workdir}/recon_replication/reidmodule/
    • setsid /usr/bin/python3 runreid.py 40 r00
  9. Change to directory for second stage of reidentification
    • cd ${workdir}/recon_replication/reidpaper/programs/
  10. Run second stage of reidentification
    • setsid stata-se -b runall.do
  11. Change to the results directory
    • cd ${workdir}/recon_replication/reidpaper/results/
  12. Numerical result from this module are not publicly shareable and will be located in:
    • ${workdir}/recon_replication/reidpaper/results/CBDRB-FY22-DSEP-004/CBDRB-FY22-DSEP-004.xlsx

Tabular output

  1. Change to the directory containing Stata code for tabular output
    • cd ${workdir}/recon_replication/results/
  2. Link the outputs from reidpaper_python into the in folder:
    • ln -s ${workdir}/recon_replication/reidpaper/results/CBDRB-FY22-DSEP-004/CBDRB-FY22-DSEP-004.xlsx in/CBDRB-FY22-DSEP-004.xlsx
  3. Run the table generation code
    • stata-se -b make_tables.do
  4. Change to output directory to view tabular results:
    • cd ${workdir}/recon_replication/results/out/

Suppression

  1. Change to the directory containing Python code for suppression results
    • cd ${workdir}/recon_replication/suppression/
  2. Process CEF data into the required format
    • python recode.py
  3. Create suppression output
    • python suppression.py > suppression_results.txt
  4. Suppression results can be found in:
    • ${workdir}/recon_replication/suppression/suppresion_results.txt

Swapping [^1]

[^1] Replication results for swapping are obtained by using the reconstructed swap files found in the dataset list.

  1. Change to the directory containing the swapping code
    • cd ${workdir}/recon_replication/reid_swap/
  2. Create swap pair lists
    • python pairs_driver.py
  3. Create swapped person file
    • python swap.py
  4. Swapped CEF person files for input into reconstruction can be found in:
    • ${workdir}/recon_replication/reid_swap/LO/swapped_us.csv
    • ${workdir}/recon_replication/reid_swap/HI/swapped_us.csv

Metrics

  1. Change to the directory containing the metrics code
    • cd ${workdir}/recon_replication/metrics/
  2. Create recoded CEF file needed for metrics
    • python recode.py --infile ${workdir}/data/reid_module/cef/cef.csv --outfile cef.csv --cef
  3. Run metrics for HI swap experiment
    • python metrics.py -c HIconfig.yml
  4. Run metrics for LO swap experiment
    • python metrics.py -c LOconfig.yml
  5. Create spreadsheet output for HI swap results
    • python tables.py -v r04 -r False
  6. Create spreadsheet output for LO swap results
    • python tables.py -v r05 -r False
  7. Metric results for the HI swap experiment will be in:
    • ${workdir}/recon_replication/metrics/output/r04/metrics_r04.xlsx
  8. Metric results for the LO swap experiment will be in:
    • ${workdir}/recon_replication/metrics/output/r05/metrics_r05.xlsx

List of tables reproduced by or found in this replication package

This replication archive reproduces tabular results listed in the [accompanying spreadsheet](<manuscript/hdsr/20231214-HDSR submission tables and figures.xlsx>)