mozhao0331 / Restaurant_Segmentation_Analysis

MIT License
2 stars 2 forks source link

2023-05-17 Official Partner Meeting Agenda #52

Closed mozhao0331 closed 1 year ago

mozhao0331 commented 1 year ago

Moderator: Morris Notetaker: Xinru

Weekly check-in: Project Board

Question:

  1. Can we group the education levels into two categories (e.g. below and above?)

  2. emp_military_ta: is the feature percentage instead of counts?

  3. Are these features related, just using different measurements?

    • hh_inc_gt_500k_p_ta
    • hh_inc_gt_75k_p_ta
    • hh_inc_lt_75k_p_ta (the above two features correlation = -1)
  1. Can you explain hhgrpycy_ta and hhgrfypy_ta?

  2. Should we use percentage or count?

  3. Which pandas version should we use?

Lorraine97 commented 1 year ago

Meeting notes:

Answer to questions:

  1. Yes. Combine into correct categories. (Bachelor is a good threshold: Bachelor and above vs. below Bachelor) <- bachelorpl included, can be used directly.
  2. emp_military_ta should be emp_military_p_ta as a percentage measure
  3. Different intervals for the household incomes
  4. Household growth measures: previous year to current year; expected growth in a five-year projection
  5. Concern: using count variables that the count measures will be highly correlated to the population/density; double counting factors if using count variables; if used, need to integrate with density
  6. We should stick to the versions provided in the yaml file, for compatibility reasons

Weekly progress tracker:

Notes:

CChCheChen commented 1 year ago

Meeting notes from Sitewise:

  1. CY = 2022
  2. we do have a typo on emp_military, that is actually a percentage variable. Its calculated by dividing by empcy = total employees in the current year.
  3. on random forest check: min_samples_leaf: default = 1, way too small. iterate around 10 at least and see how this is changing as you increase to maybe 50 (can work for subways usa as sample is large, smaller for the smaller models) min_samples_split: default = 2, same, should be larger than min_sample_leaf, check values around 20 and iterate up from there to see what works best.