correlation for categorical variable issue

Lindaaaaaa commented 6 years ago

As discussed in the meeting,

I tried chi-square to compare if the two categorical variables are independent. I have tried gender vs all other categorical variables , also education level vs all other categorical variables. Result shows that most of them have p-value <0.05, which indicates dependency. Especially, for education level, all p-values <0.05.

I am not sure I am doing the right thing. If it is correct, then since education level is picked as an important variable, then shall we include all other categorical variables since they all have dependency on education level? Any suggestions would be appreciated.

tb<-table(newdata$GENDER_R,newdata$ED_Level)#p-value = 1.507e-12 dependent
chisq.test(tb) 
tb<-table(newdata$GENDER_R,newdata$age_cat)#p-value = 0.5242 independent
chisq.test(tb) 
tb<-table(newdata$GENDER_R,newdata$Full_part)#p-value = 7.358e-13 dependent
chisq.test(tb) 
tb<-table(newdata$GENDER_R,newdata$pub_priv)#p-value = 0.2215 independent
chisq.test(tb)
tb<-table(newdata$GENDER_R,newdata$flex_cat)#p-value =  0.01908 dependent
chisq.test(tb)
tb<-table(newdata$GENDER_R,newdata$NFE12)#p-value =  0.009079 dependent
chisq.test(tb)
tb<-table(newdata$GENDER_R,newdata$FNFE12JR)#p-value =  0.003244 dependent
chisq.test(tb)

tb<-table(newdata$ED_Level,newdata$age_cat)#p-value = 2.268e-08 dependent
chisq.test(tb)
tb<-table(newdata$ED_Level,newdata$Full_part)#p-value = 0.0004919 dependent
chisq.test(tb)
tb<-table(newdata$ED_Level,newdata$pub_priv)#p-value < 2.2e-16 dependent
chisq.test(tb)
tb<-table(newdata$ED_Level,newdata$flex_cat)#p-value = 2.766e-10 dependent
chisq.test(tb)
tb<-table(newdata$ED_Level,newdata$NFE12)#p-value < 2.2e-16 dependent
chisq.test(tb)
tb<-table(newdata$ED_Level,newdata$FNFE12JR)#p-value < 2.2e-16 dependent
chisq.test(tb)

NSKrstic commented 6 years ago

So I believe your group previously said that you have narrowed down to 6 categorical variables. Among those 6 categorical variables, perform chi-square tests for different pairwise combinations of the variables. Therefore you should have 15 tests. These tests will tell us whether there are associations between pairs of variables (kind of like correlations between continuous variables). However, it's more of a "yes" or "no" test, rather than quantifying the association (although the p-value and contingency tables may give us some idea).

Once you discover which pairs of variables are significantly associated, you may want to make a table (similar to the correlation matrix, but instead it can indicate which ones are significantly associated with others, at the 5% significance level). This is more under the exploratory analysis, and thus shouldn't affect what variables you include in your model.

When you explore relationships between the response variable and explanatory variables, THEN, you can consider keeping them in your model, even if they get eliminated during variable selection. There should be evidence to suggest that the variable is a potentially important predictor.

Lindaaaaaa commented 6 years ago

Hi @NSKrstic

I understand what you are saying and I think it is exactly what I am doing. But I am just wondering for education level I already checked it is associated with all other 5 categorical variables. And I don't even need to check other association (the other 10 cases) because education level is picked by model selection. So the conclusion is that all the other 5 categorical variables also associated with our response variable ?

To make myself clear, AIC chooses ED_level, Age, Gender, work_flex .... When it comes to the conclusion, since ED_level is associated with all other 5 categorical variables (pri_pub,age, work_flex, NFE12,FNFE12JR), we should conclude that ED_level, Age, Gender, work_flex, pri_pub, NFE12,FNFE12JR are all associated with num score?

KellyHu commented 6 years ago

So you checked the correlation between explanatory variables (ED_level and other 5 categorical variables)? Then that should not affect what variables we include in our model from my understanding. Let me know if I'm on the right track. Thanks!

NSKrstic commented 6 years ago

@Lindaaaaaa

No, because you're investigating associations between explanatory variables. Like I said, this is more explanatory analysis and should have nothing to do with what model you have or model selection. What we can conclude is that it seems many of the categorical explanatory variables are associated with each other. That's likely one reason, once you conducted modelling, that several of these categorical variables were not included within the model (since ED_level likely already captures most of the information shared with those other variables). Also, just because ED_level is associated with two other variables, doesn't necessarily mean that the two other variables are also associated with each other.

If you want to do so for your response variable (which is likely what the client may be more interested in), then you would conduct t-tests or ANOVA (which I believe your group has done, correct?). Those results tell us whether or not a categorical variable is associated with the response.

NSKrstic commented 6 years ago

@KellyHu

That's correct. It may let you think about removing explanatory variables if you believe they are highly associated with each other, but also generally inform you about their relationships.

Lindaaaaaa commented 6 years ago

I did a bit more research on correlation between categorical variables. As @NSKrstic said, chi-square gives only sort of 0/1 answer (if two variables are associated with each other or not). From the link below , it looks like Crammer's V measures correlation by giving a number between 0 and 1. I think it is a better way since it provides the magnitude of correlation. https://datascience.stackexchange.com/questions/893/how-to-get-correlation-between-two-categorical-variable-and-a-categorical-variab

So the final result for the categorical correlation is

Looks like only FNFAET12JR and NFE12 have very strong correlation, which is 0.83376057

For the continuous variables part, the result is here. Looks like none of the correlations is very strong

tom-hc-park / STAT550-450-for-Seniorworkers-from-Korea

correlation for categorical variable issue #15