Data sets - Githubissues

Will eventually add a Data sets section in the intro chapter so we can avoid having to introduce data sets throughout the book (e.g., those used in examples); we don't have to do this for new data sets used in case studies and exercises, etc.

Binary outcomes:

GSS data
- https://gss.norc.org/
- Used in Agresti's CDA books
- Probability samples, so inference is valid (Consumer Research survey sample would be better application-wise)
Default of credit card clients (UCI)
- https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
- Good intro example using pay_0 as single feature (shows good probability spread from 0-1 across predictor space)
- Patrick Hall (formerly at H2O) has lots of examples with these data, for example: https://github.com/jphall663/GWU_rml
Credit default data
- https://cran.rstudio.com/web/packages/ISLR2/ISLR2.pdf
The Insurance Company (TIC) Benchmark
- http://www.liacs.nl/~putten/library/cc2000/data.html
- Emphasis on explainability (see associated papers and competition questions)
Employee attrition
- https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
Hotel bookings (could make a good case study or used in the viz chapter)
- https://www.sciencedirect.com/science/article/pii/S2352340918315191
- Great candidate for visualization chapter. Can aggregate outcome to estimate/plot proportions within unique values of other features (e.g., lead_time and deposit_type).
- Counterintuitive effect of deposit_type (try it w/ simple LR model), which could make for good discussion about role of SME in identifying potential issues with the ETL process. Some discussion on Kaggle at https://www.kaggle.com/code/marcuswingen/eda-of-bookings-and-ml-to-predict-cancelations.
Blood transfusion (direct marketing example)
- https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center
- See resources/articles/yeh-2009-uciblood.pdf
In-vehicle coupon recommendation
- https://archive.ics.uci.edu/ml/datasets/in-vehicle+coupon+recommendation
- See A bayesian framework for learning rule sets for interpretable classification

Multinomial (i.e., polytomous) outcomes:

GSS data:

Ordinal outcomes:

GSS data
Car evaluation data set from UCI
- https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Counts:

Bike sharing (often treated as continuous)
- https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Inference:

The 2021 GSS data are based on a probability sample and could be used for analyzing contingency tables.
Employee attrition data seem hard to come by, so we could treat the IBM HR attrition data as a sample (or take a sample thereof) for analyses (e.g., ordinal association, etc.).

Missing values:

The 2021 GSS data contains lots of missing values.

koalaverse / homlr-2ed

Data sets #8