tom-hc-park / STAT550-450-for-Seniorworkers-from-Korea

0 stars 0 forks source link

First draft proposal is ready for review #9

Open Lindaaaaaa opened 6 years ago

Lindaaaaaa commented 6 years ago

It's in the documents folder. Please give us some feedback. Thanks!

KellyHu commented 6 years ago

Thanks everyone for their efforts in our proposal! Looking forward to any comment!

tom-hc-park commented 6 years ago

Added link for the proposal on the readme file. Thanks for your effort!

gcohenfr commented 6 years ago

Good start to both groups! The proposal needs some more work. Please use your own words when describing the project, do not copy-paste from the client's proposal. Also, include the size of the data to have some idea of how many variables the model may have.

For the research questions, are you going to study the "effect" of covariates in workers' skills? Does the demographics affect the skills or explain the skills. Be careful with the wording used. S550 should help with this.

You propose using stepwise AIC, why? what is the goal?

KellyHu commented 6 years ago

Dear Professor Gabriela @gcohenfr,

Thank you for the feedback! There are 1247 observations of 25 variables in our original dataset. I will add that to our proposal.

Thanks!

Best, Jingyi

Lindaaaaaa commented 6 years ago

Using stepwise AIC could help with model selection (i.e. find which variables are important). AIC is more robust to correlation between explanatory variables. @gcohenfr

gcohenfr commented 6 years ago

Sure, but this information is not in the proposal. My point is that you need to elaborate it a bit more.

gcohenfr commented 6 years ago

I forgot to mention that S550 does not need to draft a proposal at this point. S550 can include their proposal together with that of S450 if they are ready but only S450 proposal is mandatory for Th, Feb 8th

ekroc commented 6 years ago

Just a practical recommendation regarding data-driven model-selection: I'd recommend considering both the AIC and the BIC when selecting a model purely from the given data. The AIC is designed to choose the "best" predictive model (in some sense, under certain conditions), while the BIC is designed to choose the "truest" model for the data (again, in some sense under certain conditions). In particular, the BIC penalizes for overly-complex models that may nevertheless be good at predictions. Oftentimes, the "best" model identified by AIC will be the same as under the BIC, which will lend confidence to your model choice if this happens. If not, they will still often lead you to a natural small set of candidate models to choose from.

liuzhen529 commented 6 years ago

Yes, I think we could use AIC and BIC together to determine what models and factors we should use. Thanks for your suggestion.

Also, during our meeting, there is one wrongful observation so our observation number should be 1246. For variables, we decided to not use two of these variables about managing people since there are too many missing values. Also, for private/public sector, we are discussing if we should treat it as covariate. So the number of variables needs our further discussion.

NSKrstic commented 6 years ago

Hey STAT 450,

You can find my personal feedback on the proposal within the Documents folder here.

Apologies for the delay. Also, as Gaby has mentioned and I previously addressed during our last meeting, please refrain from copying external material into your writing without proper citation. This includes the client's proposal.

Let me know if any of you have anymore questions or need any of my feedback clarified.

NSKrstic commented 6 years ago

Regarding which variables to include/eliminate, as Zhen has brought up, I think eliminating the manager-related variables is necessary since almost half of the observations have missing information. If we decide not to eliminate them, another concern is that the variables are linked with one another. Being a manager (variable "Mgr") means that there is a non-zero number of people they manage (variable "Mgr_c"), and vice versa. So likely only the latter should be chosen.

For the private/public sector, if it's true that the sample is representative of the workforce in South Korea, then I think it should be fine to consider including it as a predictor in your model. Although the classes are imbalanced, I don't think it's extreme considering the number of observations from the public sector is greater than 10% of the dataset. You may just need to consider potential limitations due to this imbalance. However, I think performing "public+private" or "private only" analyses are both justifiable (you could even conduct both).