2023-05-17 Official Partner Meeting Agenda

mozhao0331 commented 1 year ago

Moderator: Morris Notetaker: Xinru

Weekly check-in: Project Board

Question:

Can we group the education levels into two categories (e.g. below and above?)
emp_military_ta: is the feature percentage instead of counts?
Are these features related, just using different measurements?
- hh_inc_gt_500k_p_ta
- hh_inc_gt_75k_p_ta
- hh_inc_lt_75k_p_ta (the above two features correlation = -1)

Lorraine97 commented 1 year ago

Meeting notes:

Answer to questions:

Yes. Combine into correct categories. (Bachelor is a good threshold: Bachelor and above vs. below Bachelor) <- bachelorpl included, can be used directly.
emp_military_ta should be emp_military_p_ta as a percentage measure
Different intervals for the household incomes
Household growth measures: previous year to current year; expected growth in a five-year projection
Concern: using count variables that the count measures will be highly correlated to the population/density; double counting factors if using count variables; if used, need to integrate with density
We should stick to the versions provided in the yaml file, for compatibility reasons

Weekly progress tracker:

Proposal completed!
Data cleaning: count - percentage; sports venues; weather-related columns
Ensemble model: LR, RFC, etc.
Q: feature engineering or model tuning at this stage for SK? A: continue model tuning now
Q: which ensemble model was used? A: overlapped features from 4 feature selection models + linear model & etc. => Rank of feature importances and get the top 20 features (positively and negatively)
Q: debug? A: smaller chunk for code blocks;

Notes:

Data cleaning: get rid of -9999
Non-percentage variables (POIs and store only) result: seems to be overfitting a bit; max_depth needed to be tuned a bit (would like similar results from training and testing)
Check for parameters for the RFC model:
- min_samples_leaf: default = 1, way too small. iterate around 10 at least and see how this is changing as you increase to maybe 50 (can work for subways usa as sample is large, smaller for the smaller models)
- min_samples_split: default = 2, same, should be larger than min_sample_leaf, check values around 20 and iterate up from there to see what works best.
The difference between train and test accuracy should be less than 5%; rather a stable model with lower accuracy than an unstable model with higher accuracy; accuracy lowest acceptance threshold should be above 60%

CChCheChen commented 1 year ago

Meeting notes from Sitewise:

CY = 2022
we do have a typo on emp_military, that is actually a percentage variable. Its calculated by dividing by empcy = total employees in the current year.
on random forest check: min_samples_leaf: default = 1, way too small. iterate around 10 at least and see how this is changing as you increase to maybe 50 (can work for subways usa as sample is large, smaller for the smaller models) min_samples_split: default = 2, same, should be larger than min_sample_leaf, check values around 20 and iterate up from there to see what works best.