tom-hc-park / STAT550-450-for-Seniorworkers-from-Korea

0 stars 0 forks source link

Method part of our group proposal #6

Open KellyHu opened 6 years ago

KellyHu commented 6 years ago

Hi everyone!

I was writing the method part of our group proposal and here are some of my concerns:

  1. Can we use goodness of fit for model selection?
  2. I remember during the meeting, a linear regression was used by our client? I suggest using glm instead of lm in R where we can fit multiple types of regressions instead of only linear regression. For example, logistic regression for binary dependent variable; poisson for counts (dependent variable). I understand that so far we only have the continuous dependent variables, so maybe linear regression is the most suitable one. However, we are always open to any possible cases which might pop up in the future if our client requests. So should I write glm in our method? glm can absolutely fit linear regression, so dont worry.
KellyHu commented 6 years ago

Goodness of fit: for example, we can look at the R^2 and/or adjusted R^2 of the model

NSKrstic commented 6 years ago
  1. One of the primary concerns with using goodness of fit measures for model selection is that we may overlook overfitting. Typically, the more variables we add, the better the fit on our data. Unfortunately, an extremely complex model (with multiple predictors) is going to fit the data "too well". This means we lose the ability to generalize the model to the population of interest (South Korean senior workers in the private sector). The model also ends up performing poorly outside of our dataset, either with new data or if we attempt to make predictions.

We can potentially use Adjusted R^2 (since it penalises models with more predictors), but there are two other measures typically used for model selection. These are AIC and BIC. I've linked a resource that gives a brief overview of both measures:

http://onlinelibrary.wiley.com/doi/10.1002/9781118856406.app5/pdf

  1. I don't believe GLM will be necessary. The proficiency variables are continuous and the skill use variables are continuous/ordinal. So we would use neither logistic regression nor poisson regression. Anything beyond that may be a little too advanced for you.
KellyHu commented 6 years ago

Thank you so much for your advise!! I will revise that in our proposal.

gcohenfr commented 6 years ago

Perhaps this is a good place to discuss the characteristics of your response. If your variable is mostly continuous but one value (y=1 or y=0) gets many observations, then some assumptions of the inference used by linear regression may not be satisfied. We have been discussing this in class but I don't see any comments about it here or in the proposal.