[ ] The purpose of this exercise is to provide actionable intelligence to enable a person to make a "data driven" decision, not to be an absolute authoritative measure of a quantity. The measure for good enough is making a testable prediction then testing it (i.e. hypothesis testing, "a/b testing").
[ ] """
As noted earlier, to ensure the confi dentiality of ACS
respondents, PUMS fi les present data for a much more
limited set of geographic areas than the pretabulated
ACS products. PUMS fi les cannot be used to summarize
data for individual counties, cities, or other small
areas. It is possible to summarize data for the nation,
each of the states, the District of Columbia, Puerto
Rico, and areas known as Public Use Microdata Areas
(PUMAs).
As part of Census 2000, PUMAs were defined as areas
with 100,000 residents or more based on the populations
reported in Census 2000. The ACS uses these
same PUMAs.
"""
[ ] "To show where a particular PUMA is located, the
Census Bureau has provided both maps and geographic
equivalency files."
[ ] The Census 2000 PUMAs will be used until the
new PUMAs are delineated using 2010 Census counts.
[ ] """ A separate geographic equivalency
fi le exists for each state. They are found in the
main PUMS link from the Census Bureau Web site:
http://www.census.gov/main/www/pums.html
For instance, using the New York equivalency fi le and
sorting the data by summary level code, one can see
which census tracts are grouped together in a PUMA or
which PUMAs compose New York County [Manhattan],
(03801-03810). Geographic Equivalency Files can be
accessed at the following FTP site: http://www2.census.gov/census_2000/datasets/PUMS/FivePercent/
The Missouri State Data Center created a tool that
allows PUMA users to enter the geography that they
are interested in to identify PUMA codes and
equivalent geographies. For more information, go to
http://mcdc2.missouri.edu/websas/geocorr2k.html
"""
[x] Smallest area in PUMS is census block group
[x] Data from the Census Bureau balances representing areas and representing populations while also protecting respondents' privacy. Each record in the PUMS data is a group of census blocks that represents an area with about 100,000 people (and weighting factors?). Also, block groups with sparser populations are not included every year and are only represented in multi-year data sets.
[x] Under "Creating PUMS Tabulations"
[x] """For the 1-year PUMS the infl ation
adjustment variable (ADJUST) should be used to
produce income characteristics. The 3-year and 5-year
PUMS will carry two infl ation adjustment variables—
ADJINC for income applications and ADJHOUS for housing
cost applications. For additional information about
dollar-denominated data in the ACS, refer to Appendix 5."""
[ ] try:
* [ ] Use confidence intervals from numerical sampling instead of Z-score. Numerical sampling is more extensible.
* [ ] I choose not to weight by the coefficient of variation (Weight = 1/(coef_of_var)**2) but instead incorporate the uncertainty through numerical simulation. I prefer to sample the distributions to infer the statistical significance (reword, link to stats for hackers)
[ ] use f1 score for classification, r2 for regression, compute probability of belonging to class with logistic regression and/or calibrating the prediction probability
[ ] Test distribution of test (evaluation) set and compare to multinomial distribution 9https://en.wikipedia.org/wiki/Multinomial_distribution): Link to Jake's stats for hackers in footnotes and "Helpful links"
[ ] Because I'm predicting a quantity, I could use linear discriminant analysis to create features that optimize selection of that quantity rather than PCA. This would require discretizing the continuous household income variable. Example: http://www-users.cs.umn.edu/~ludford/Stat_Guide/Linear_Discriminant_Analysis.htm
[ ] It's important to note that the distribution of your data may invalidate certain models. Many models (e.g. simple linear regression) assume a normally distributed dependent variables.
[ ] A useful collection of best practices and rules of thumb for classical statistical methods with references. Does mention bootstrap sampling, briefly covers Bayesian methods, does not cover cross validation and permutation tests. The code examples are older using R code and Dataplot (a custom program written by NIST staff in Fortran and C).
[ ] Most useful are section 1 (Exploratory Data Analysis), case studies, and TODO
[ ] Explanation of linear least squares from an application point of view.
[ ] The appropriate loss function depends on the question you want to answer, the values you deem most important to your application, your model, and the distribution of your data. Example: Consider the precision of a location estimator (mean, median, mode) for different distributions (http://www.itl.nist.gov/div898/handbook/eda/section3/eda351.htm).
[x] Reducing number of features helps model to be "parsimonious", comply with "Occam's razor".
[ ] You can also opt for the fewest parameters on the principle of Occam's Razor ("Law of Parsimony"). However, we are not trying to make an econometric theory for why some people have lower household incomes; we are only trying to predict what those incomes are.
[ ] See "Learning Spark" for setting up the cluster; "Advanced Analytics with Spark" for also setting up the cluster and Ch11 for pyspark with ipynbs
[ ] Motivation for reducing features: http://spark.apache.org/docs/latest/mllib-decision-tree.html "Computation scales approximately linearly in the number of training instances, in the number of features, and in the maxBins parameter. Communication scales approximately linearly in the number of features and in maxBins."
Useful sklearn sections:
Helpful links:
Other notes:
Questions:
About data:
About model:
https://www.quora.com/What-is-an-intuitive-explanation-of-the-relation-between-PCA-and-SVD
http://www3.cs.stonybrook.edu/~sael/teaching/cse549/Slides/CSE549_16.pdf
About presentation:
About motivation:
Embedding google maps:
APIs:
Intro with ML model: https://www.youtube.com/watch?v=s-i6nzXQF3g
PyCon 2014: https://www.youtube.com/watch?v=px_vg9Far1Y
Spark: