20160221-predict-household-income-from-census.md

[ ] For hidden markov models for time series: (mcmc) http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Prologue/Prologue.ipynb
[ ] The purpose of this exercise is to provide actionable intelligence to enable a person to make a "data driven" decision, not to be an absolute authoritative measure of a quantity. The measure for good enough is making a testable prediction then testing it (i.e. hypothesis testing, "a/b testing").
[x] Need to adjust dollar amounts for inflation.
[x] Related examples:
- [x] https://mathematicaforprediction.wordpress.com/2014/03/30/classification-and-association-rules-for-census-income-data/
- [x] http://www.knowbigdata.com/blog/predicting-income-level-analytics-casestudy-r
- [x] https://rpubs.com/Jovin/census_data_income
- [x] https://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD) (See "Papers That Cite This Dataset")
- [x] https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990) (see "Papers That Cite This Dataset")
[ ] From Public Use Micro Sample: https://www.census.gov/library/publications/2009/acs/pums.html
- [ ] Under "PUMS Geography":
- [ ] """ As noted earlier, to ensure the confi dentiality of ACS respondents, PUMS fi les present data for a much more limited set of geographic areas than the pretabulated ACS products. PUMS fi les cannot be used to summarize data for individual counties, cities, or other small areas. It is possible to summarize data for the nation, each of the states, the District of Columbia, Puerto Rico, and areas known as Public Use Microdata Areas (PUMAs). As part of Census 2000, PUMAs were defined as areas with 100,000 residents or more based on the populations reported in Census 2000. The ACS uses these same PUMAs. """
- [ ] "To show where a particular PUMA is located, the Census Bureau has provided both maps and geographic equivalency files."
- [ ] The Census 2000 PUMAs will be used until the new PUMAs are delineated using 2010 Census counts.
- [ ] """ A separate geographic equivalency fi le exists for each state. They are found in the main PUMS link from the Census Bureau Web site: http://www.census.gov/main/www/pums.html For instance, using the New York equivalency fi le and sorting the data by summary level code, one can see which census tracts are grouped together in a PUMA or which PUMAs compose New York County [Manhattan], (03801-03810). Geographic Equivalency Files can be accessed at the following FTP site: http://www2.census.gov/census_2000/datasets/PUMS/FivePercent/ The Missouri State Data Center created a tool that allows PUMA users to enter the geography that they are interested in to identify PUMA codes and equivalent geographies. For more information, go to http://mcdc2.missouri.edu/websas/geocorr2k.html """
- [x] Smallest area in PUMS is census block group
- [x] Data from the Census Bureau balances representing areas and representing populations while also protecting respondents' privacy. Each record in the PUMS data is a group of census blocks that represents an area with about 100,000 people (and weighting factors?). Also, block groups with sparser populations are not included every year and are only represented in multi-year data sets.
- [x] Under "Creating PUMS Tabulations"
- [x] """For the 1-year PUMS the infl ation adjustment variable (ADJUST) should be used to produce income characteristics. The 3-year and 5-year PUMS will carry two infl ation adjustment variables— ADJINC for income applications and ADJHOUS for housing cost applications. For additional information about dollar-denominated data in the ACS, refer to Appendix 5."""
[ ] try: * [ ] Use confidence intervals from numerical sampling instead of Z-score. Numerical sampling is more extensible. * [ ] I choose not to weight by the coefficient of variation (Weight = 1/(coef_of_var)**2) but instead incorporate the uncertainty through numerical simulation. I prefer to sample the distributions to infer the statistical significance (reword, link to stats for hackers)
[x] Why Census?
- [x] I was inspired by this Kaggle competition: https://www.kaggle.com/c/2013-american-community-survey
[x] Predict all financial fields (q47 and q48)? Or just q48?
[x] Predict household income rather than personal: http://thenextweb.com/socialmedia/2012/01/24/the-top-30-stats-you-need-to-know-when-marketing-to-women/
[ ] Get lat lon from zipcode: http://stackoverflow.com/questions/5585957/get-latlng-from-zip-code-google-maps-api
[ ] Get lat lon from group areas: http://www.census.gov/geo/maps-data/

Useful sklearn sections:

[ ] sklearn sec 1: feat importance, partial dependence, prob of class belong
[ ] sklearn sec 2: manifold learning for vis; pca for vis and dim reduction
[ ] sklearn flowchart: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html; use as roadmap
[ ] sklearn sec 3:
- [ ] k-fold validation for many k
- [ ] link to "Note on shuffling"; use for cols and rows: http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
- [ ] http://scikit-learn.org/stable/modules/grid_search.html use n_jobs=-1, error_score=np.nan
- [ ] http://scikit-learn.org/stable/modules/model_evaluation.html use dummy estimators for a baseline
- [ ] use f1 score for classification, r2 for regression, compute probability of belonging to class with logistic regression and/or calibrating the prediction probability
- [ ] save jobs with http://scikit-learn.org/stable/modules/model_persistence.html
[ ] Test distribution of test (evaluation) set and compare to multinomial distribution 9https://en.wikipedia.org/wiki/Multinomial_distribution): Link to Jake's stats for hackers in footnotes and "Helpful links"
[ ] sklearn docs sec 4:
- [ ] Use Pipeline: http://scikit-learn.org/stable/modules/pipeline.html
- [ ] Join linear and kernel pca using feature union
- [ ] Examples of feature creation: http://scikit-learn.org/stable/modules/feature_extraction.html
- [ ] Encoding categorical features: http://scikit-learn.org/stable/modules/preprocessing.html
- [ ] Options for imputing missing values: http://scikit-learn.org/stable/modules/preprocessing.html
- [ ] Custom transformers
[ ] sklearn docs section 6:
- [ ] processing data at scale: http://scikit-learn.org/stable/modules/scaling_strategies.html#scaling-with-instances-using-out-of-core-learning
- [ ] need to use incremental estimators
[ ] Also use combining estimators?
[ ] Kernel pca instead of pca?
[ ] Add sklearn docs to "Helpful links"
[ ] use warm_start for incremental random forest.
[ ] sklearn docs section 7:
- [ ] give latency and throughput: http://scikit-learn.org/stable/modules/computational_performance.html#prediction-latency
- [ ] give versions blas and lapack: http://scikit-learn.org/stable/modules/computational_performance.html#prediction-latency
- [ ] Optimizing and profiling: http://scikit-learn.org/stable/developers/performance.html
[ ] create cross features since incremental pca is linear.
[ ] when to stop clustering: choose number of clusters with cross-validation. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
[ ] do feature selection based on explained variance
[ ] use classification report and confusion matrix for classification example: http://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
[ ] example pipeline: http://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
[ ] only use algorithms that currently implement a warm_start method for scalablility
[ ] Give confidence level for predictions
[ ] vectorize categories: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer
[ ] select features from model output: http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection
[ ] Only choose methods that have feature importance criteria for actionable intelligence
[ ] Show which k-fold to use with a learning curve
[ ] AUC vs F1 for classification, doesn't matter but F1 is faster to compute: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
[ ] Use R2 for regression: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score; diff metrics for diff purposes
[ ] To choose number of clusters:
- [ ] Silhouette score: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#example-cluster-plot-kmeans-silhouette-analysis-py
- [ ] other ways: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
[ ] imputing with k-nearest neighbors:
- [ ] https://github.com/scikit-learn/scikit-learn/issues/2989
- [ ] show that imputing is better than other methods
- [ ] R has many imputation methods

Helpful links:

[x] Helpful for a social scientists' perspective http://www-rohan.sdsu.edu/~gawron/python_for_ss/course_core/book_draft/data/PUMS_data.html
[x] For deciding r vs python http://programmers.stackexchange.com/questions/181342/r-vs-python-for-data-analysis
[ ] For statistical tests walk through: http://www-users.cs.umn.edu/~ludford/Table_of_Content.htm
[ ] For applications of classical lstatistics in an industrial setting: http://www.itl.nist.gov/div898/handbook/index.htm, http://www.nist.gov/itl/sed/gsg/handbook_project.cfm

Other notes:

[x] use collapsable cells for code and images

Questions:

[x] What is the value-add?

About data:

[ ] Mix of continuous and categorical variables with different distributions and scales.
[ ] Census data is censored, ambigified, and top-coded (https://en.wikipedia.org/wiki/Censoring_(statistics))
[ ] Because I'm predicting a quantity, I could use linear discriminant analysis to create features that optimize selection of that quantity rather than PCA. This would require discretizing the continuous household income variable. Example: http://www-users.cs.umn.edu/~ludford/Stat_Guide/Linear_Discriminant_Analysis.htm
[ ] Use the log of household income rather than absolute value since the loss function is mean squared error. Can also use Box-Cox normality transformation: http://www.itl.nist.gov/div898/handbook/eda/section3/boxcoxno.htm http://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.boxcox.html Can also use weights: http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd452.htm
[ ] Split the year of data acquisition from the serial number.
[ ] Map each categorical variable (fewer than 1k unique values) to the median income for that group.
[ ] For interior coordinates from shape files, use the system barycenter (center of mass): http://www.blackpawn.com/texts/pointinpoly/default.html

About model:

[ ] Mention "Do we need hundreds of classifiers paper"
[ ] Format API to be like Kaggle's recommendation for models.
[ ] Wiki page https://en.wikipedia.org/wiki/Decision_tree_learning
[ ] It's important to note that the distribution of your data may invalidate certain models. Many models (e.g. simple linear regression) assume a normally distributed dependent variables.
[ ] Useful link: See http://www-users.cs.umn.edu/~ludford/Table_of_Content.htm for an overview of statistical tests.
[ ] Useful link: Jake's stats for hackers, Bayesian Methods for Hackers, OReilly Think Stats, OReilly Think Bayes, Statistics is Easy,
[ ] Explicitly how did I address NaNs and outliers.
[ ] NIST Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm
- [ ] Could apply a Box-Cox normality transformation if model method depends on normality of variable: http://www.itl.nist.gov/div898/handbook/eda/section3/eda336.htm
- [ ] If you are interested in the significance of the factors, consider using an analysis of variance (links to NIST and http://www-users.cs.umn.edu/~ludford/stat_overview.htm)
- [ ] A useful collection of best practices and rules of thumb for classical statistical methods with references. Does mention bootstrap sampling, briefly covers Bayesian methods, does not cover cross validation and permutation tests. The code examples are older using R code and Dataplot (a custom program written by NIST staff in Fortran and C).
- [ ] Most useful are section 1 (Exploratory Data Analysis), case studies, and TODO
- [ ] Explanation of linear least squares from an application point of view.
[ ] The appropriate loss function depends on the question you want to answer, the values you deem most important to your application, your model, and the distribution of your data. Example: Consider the precision of a location estimator (mean, median, mode) for different distributions (http://www.itl.nist.gov/div898/handbook/eda/section3/eda351.htm).
[x] Reducing number of features helps model to be "parsimonious", comply with "Occam's razor".
- [ ] Classical perspective on feature selection: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i2.htm
[ ] Map cluster labels to median household income
[ ] Experiment with feature interaction after minimizing features, c.f. experimental design (http://www.itl.nist.gov/div898/handbook/pri/section5/pri594.htm). Choose up to 2-feature interaction, so if 20 features, then ~400 interactions.
[ ] Make interaction contour plots to show affect of variables. Do for random forest and linear regression. http://www.itl.nist.gov/div898/handbook/pri/section5/pri59a1.htm
[ ] You can also opt for the fewest parameters on the principle of Occam's Razor ("Law of Parsimony"). However, we are not trying to make an econometric theory for why some people have lower household incomes; we are only trying to predict what those incomes are.
[ ] PCA vs SVD: PCA applies SVD to the covariance matrix of the features. I am creating features from PCA instead of from SVD since PCA has a statistical interpretation.
https://www.quora.com/What-is-an-intuitive-explanation-of-the-relation-between-PCA-and-SVD
http://www3.cs.stonybrook.edu/~sael/teaching/cse549/Slides/CSE549_16.pdf
[ ] Spark MLlib requires classes be labeled 0-C-1, so map mean salaries to new class label (page 308, Learning Spark).
[ ] IPYNB on spark: http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
[ ] See "Learning Spark" for setting up the cluster; "Advanced Analytics with Spark" for also setting up the cluster and Ch11 for pyspark with ipynbs
[ ] Motivation for reducing features: http://spark.apache.org/docs/latest/mllib-decision-tree.html "Computation scales approximately linearly in the number of training instances, in the number of features, and in the maxBins parameter. Communication scales approximately linearly in the number of features and in maxBins."
[ ] Compare feature selection methods to Chi2 selector: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#chisqselector

About presentation:

[ ] Include reference to presentation on slideshare
[ ] Acknowledge review from BlackLocus.
[ ] Acknowledge help from Census Bureau.

About motivation:

[ ] Personal MBA book for business topics.
Limitations of sklearn:
- [ ] Does not provide a way to implement your own custom loss functions: https://github.com/scikit-learn/scikit-learn/issues/3071 https://github.com/scikit-learn/scikit-learn/issues/1701 http://stackoverflow.com/questions/23078332/random-forests-with-a-customized-loss-function

Embedding google maps:

[ ] http://www.machinalis.com/blog/embedding-interactive-maps-into-an-ipython-nb/
[ ] gmaps python package is not active: https://github.com/pbugnion/gmaps
[ ] JavaScript Google Maps API: https://developers.google.com/maps/

APIs:

[ ] About APIs: https://zapier.com/learn/apis/chapter-1-introduction-to-apis
[ ] How to make an API: https://zapier.com/learn/apis/chapter-8-implementation/
[ ] Also see Dev HTTP Client (Google Chrome extension)
[x] Use Zapier as user-facing API? What about Built.io? https://zapier.com/developer/documentation/v2/ (use-case not supported)
[ ] Building a RESTful API with Flask:
Intro with ML model: https://www.youtube.com/watch?v=s-i6nzXQF3g
PyCon 2014: https://www.youtube.com/watch?v=px_vg9Far1Y

Spark:

[ ] Making a spark cluster: https://medium.com/google-cloud/dataproc-spark-cluster-on-gcp-in-minutes-3843b8d8c5f8#.othvu0a2f

stharrold / stharrold.github.io

20160221-predict-household-income-from-census.md #36