Yes. Combine into correct categories. (Bachelor is a good threshold: Bachelor and above vs. below Bachelor) <- bachelorpl included, can be used directly.
emp_military_ta should be emp_military_p_ta as a percentage measure
Different intervals for the household incomes
Household growth measures: previous year to current year; expected growth in a five-year projection
Concern: using count variables that the count measures will be highly correlated to the population/density; double counting factors if using count variables; if used, need to integrate with density
We should stick to the versions provided in the yaml file, for compatibility reasons
Weekly progress tracker:
Proposal completed!
Data cleaning: count - percentage; sports venues; weather-related columns
Ensemble model: LR, RFC, etc.
Q: feature engineering or model tuning at this stage for SK? A: continue model tuning now
Q: which ensemble model was used? A: overlapped features from 4 feature selection models + linear model & etc. => Rank of feature importances and get the top 20 features (positively and negatively)
Q: debug? A: smaller chunk for code blocks;
Notes:
Data cleaning: get rid of -9999
Non-percentage variables (POIs and store only) result: seems to be overfitting a bit; max_depth needed to be tuned a bit (would like similar results from training and testing)
Check for parameters for the RFC model:
min_samples_leaf: default = 1, way too small. iterate around 10 at least and see how this is changing as you increase to maybe 50 (can work for subways usa as sample is large, smaller for the smaller models)
min_samples_split: default = 2, same, should be larger than min_sample_leaf, check values around 20 and iterate up from there to see what works best.
The difference between train and test accuracy should be less than 5%; rather a stable model with lower accuracy than an unstable model with higher accuracy; accuracy lowest acceptance threshold should be above 60%
we do have a typo on emp_military, that is actually a percentage variable. Its calculated by dividing by empcy = total employees in the current year.
on random forest check:
min_samples_leaf: default = 1, way too small. iterate around 10 at least and see how this is changing as you increase to maybe 50 (can work for subways usa as sample is large, smaller for the smaller models)
min_samples_split: default = 2, same, should be larger than min_sample_leaf, check values around 20 and iterate up from there to see what works best.
Moderator: Morris Notetaker: Xinru
Weekly check-in: Project Board
Question:
Can we group the education levels into two categories (e.g. below and above?)
emp_military_ta: is the feature percentage instead of counts?
Are these features related, just using different measurements?
Can you explain hhgrpycy_ta and hhgrfypy_ta?
Should we use percentage or count?
Which pandas version should we use?