[x] Under why census data?:
The census is a relatable dataset. Everyday people can understand census data because it's about them.
The census is an applicable dataset. The census has many use-cases from local and national government to market validation. (link to use-cases)
The census is a cultivated dataset. The census bureau goes to great lengths to ensure that the census is representative of the people it samples.
The census is a large dataset. There are many features in the dataset, which permits many questions that can be answered from it. (footnote: I avoid the term "big data" here because, for this blog post, the data is not "big data" since it and all operations on it fit within RAM.)
The census is a timeseries dataset. As a timeseries, the data can be used to predict current and future quantities.
With the above points in mind, I recognize that using census data can be problematic. Census data has been used to harm people (For examples of privacy concerns, see the ACLU's FAQ section about census data. https://www.aclu.org/frequently-asked-questions-national-census), and census data categorizes people by loaded terms, such as "race" (For reasons why "race" is a problematic term, see the Wikipedia article "Race (human categorization)". https://en.wikipedia.org/wiki/Race_(human_categorization).
For the sake of this exercise, I take the census data at face value. It's important to recognize that these aren't simply statistics; these are people.
[x] From handbook App 1, use the dataset centered on each year with the largest window. This includes as many blocks as possible while keeping the data as recent as possible. Note that 1-year data is included in the 3-year estimate, which is included in the 5-year estimate. The data for each block also has its own confidence interval.
Give example following Handbook Appendix 1
[x] Point-in-time estimates needs to have a shared baseline (e.g. 5-year)
[x] Include CPS data for income (from handbook)
[x] What is the American Community Survey? (link to main page, and kaggle competition)
from handbooks """
The American Community Survey (ACS) is the new
source for the information previously collected through
the decennial census long form. This information
includes topics such as income, employment status,
housing costs, and housing conditions. Unlike the
decennial census, ACS data are collected on a continuous
basis.
"""
About 250k addresses each year are requested to participate.
[x] """The advantage of going through
the ACS Web site is that the PUMS user verifi cation fi les
are listed. User verifi cation fi les provide estimates for
selected housing and population characteristics to help
data users determine that they are using the weights
correctly. """
[x] Download the data dictionary to understand the columns.
[x] The handbooks include statistical formulas for calculating standard error, margin of error and confidence intervals.
[x] Use replicate weights methods for errors since doesn't rely on distribution type.
[x] Differentiate between Census Bureau data and census data
[x] Much of the information in the handbooks is also replicated through the site.
[x] The handbook is from 2008, but only some of the links are out-of-date. The information is very thorough with many examples replete with use-cases.
[x] The margin of error calculations assume that the data is normally distributed, which often isn't the case for real-world data (an emperical, rank-based test for normally distributed data is z1,z2, astroML link)
[x] Consult "ACS Quality Measures" for nonsample error.
[x] DataProc as a managed Hadoop and Spark service.
[x] Use DataLab when deploying analytics.
[x] Access the AppEngine application that runs the DataLab from datalab.cloud.google.com and Developers Console > Products & services > App Engine > Versions > click the version to access (e.g. "main (default)")
[x] Both BigQuery and Cloud Storage integrate with Google Prediction API.
[x] Compare to Google Prediction API.
[x] Use BigQuery when deploying to app. Use Postgres for individual development.
[x] As part of setup, ensure that the instance has "Read Only" access to "Storage": when creating a new instance, under "Project Access" > "Access and security"
why acs:
About ACS:
ETL: