unhcr / koboloadeR

deprecated please use {kobocruncher} - https://edouard-legoupil.github.io/kobocruncher/
https://unhcr.github.io/koboloadeR/docs/
28 stars 28 forks source link

Add more correlation test #19

Open Edouard-Legoupil opened 6 years ago

Edouard-Legoupil commented 6 years ago

Currently the data crunching report only deals with Chisquare test between select_one variable; https://github.com/unhcr/koboloadeR/blob/gh-pages/inst/script/3-generate-report.R#L500:L625

More correlation test could be handled... image image

Edouard-Legoupil commented 5 years ago

numeric to categoric: --> anova

if ordinal variable Kruskal-Wallis Test > kruskal.test( Mann-Whitney (This is the same as a two-sample wilcoxon test) > wilcox.test(

Edouard-Legoupil commented 5 years ago

Maybe test to be accompanied with better graphs coming from https://indrajeetpatil.github.io/ggstatsplot/

maherdaoud commented 5 years ago

What I am trying now is covering all cases of correlation test between dependent variable (target) and the independent variable. Lets start

1. Normal Distribution Numerical variable VS Normal Distribution Numerical variable We need to test if there is Monotonic relationship, Linear relationship or non-Linear relationship.

Linear relationships are monotonic, but not all monotonic relationships are linear.

  • Monotonic variables increase (or decrease) in the same direction, but not always at the same rate.
  • Linear variables increase (or decrease) in the same direction at the same rate. First we need to define the dependent and independent variables in R with the below code
    tar <- df[,target] #Dependent variable
    field <- df[,col] #Independent variable
    #df is a dataframe that contains your dataset

2. Normal Distribution Numerical variable VS Non Normal Distribution Numerical variable In this case we can't use Pearson correlation test, because the assumption behind this test is

For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve)

Here, we have to apply Spearman correlation test and find the value of Mutual Information to check if there is a Monotonic relationship or non-Linear relationship

3. Normal Distribution Numerical variable VS Binary nominal factor In this case, you can apply Two independent samples t-test to check if there is relationship between the two variables, use the below R code to run this test

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- t.test(tar~field )

p_value <- tTest$p.value
if(p_value > 0.05){
 levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
 levelOfConfidence <- "moderate"
 isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
 levelOfConfidence <- "strong"
 isPredictor <- "Yes"
}else{
 levelOfConfidence <- "very strong"
 isPredictor <- "Yes"
}

4. Normal Distribution Numerical variable VS Multinomial factor For Multinomial variable we have to apply ANOVA test as below R code

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
test  <- aov(field~tar)

p_value <- summary(test)[[1]][["Pr(>F)"]][1]
if(p_value > 0.05){
 levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
 levelOfConfidence <- "moderate"
 isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
 levelOfConfidence <- "strong"
 isPredictor <- "Yes"
}else{
 levelOfConfidence <- "very strong"
 isPredictor <- "Yes"
}

5. Non Normal Distribution Numerical variable VS Binary nominal factor For Non Normal Distribution Numerical variable we should use Wilcoxon-Mann-Whitney test Here we go

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- stats::wilcox.test(tar~field)
p_value <- tTest$p.value
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

6. Non Normal Distribution Numerical variable VS Multinomial factor To find if there is a relationship between those variables; you need to apply Kruskal-Wallis Test using the below R code

#tar :Normal Distribution Numerical variable
#field :Binary nominal factor
tTest <- stats::kruskal.test(tar~field )
p_value <- tTest$p.value
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

To be continue ...

maherdaoud commented 5 years ago

In the previous comment, we talked about the Nominal factor and what are compatible tests to check if there is a relationship with a numerical variable. Now, let's take about the Ordinal factor which is the second type of variable measurement scales

7. Ordinal factor vs Numerical variable In this case, we need to apply Spearman's Rank-Order Correlation test by using the below R code

#tar :Numerical variable
#field :Ordinal factor
cor(rank(tar),rank(field), method = "spearman")

8. Nominal factor CS Nominal factor To check if there is a relationship between two Nominal variables you need first to build contingency table and then run Chi-squared Test, check the following R code

We apply Cramer's V test to measure the association between the two variables.

#tar :Nominal factor
#field :Nominal factor
contingency_table <- table( tar, field ) 
test <- chisq.test(contingency_table , correct=F)
p_value <- test$p.value
cramerTest <- rcompanion::cramerV(contingency_table, 
                      digits=4)
statisticalTests <- "Chi-squared Test | Cramer's V test"
if(p_value > 0.05 | cramerTest < 0.29){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

8. Ordinal factor CS Ordinal factor For this case, we need to apply Linear-by-Linear Association Test

The null hypothesis for the linear-by-linear test is that there is no association among the variables in the table. A significant p-value suggests that there is an association. This is similar to a chi-square test, except that the categories are ordered in nature.

You can use the following R code

contingency_table <- table( tar, field ) 
test <- coin::lbl_test(contingency_table)
p_value <- pvalue(test)
statisticalTests <- "Linear-by-Linear Association Test"
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

9. Ordinal factor CS Nominal factor If the Nominal factor has only two categories and an Ordinal factor with k categories we need to use Cochran–Armitage Test using the below R code

#field: Nominal factor
#tar:  Ordinal factor
contingency_table <- table(field, as.numeric(tar) ,dnn = c("independent","dependent")) 
test <- chisq_test(contingency_table,scores = list("dependent" =  seq( 1 , length( levels(tar)))))
p_value <- pvalue(test)
statisticalTests <- "Cochran–Armitage Test"
if(p_value > 0.05){
  levelOfConfidence <- "very weak"
}else if(p_value <= 0.05 & p_value >= 0.01 ){
  levelOfConfidence <- "moderate"
  isPredictor <- "Yes"
}else if(p_value <= 0.01 & p_value >= 0.005 ){
  levelOfConfidence <- "strong"
  isPredictor <- "Yes"
}else{
  levelOfConfidence <- "very strong"
  isPredictor <- "Yes"
}

I think now we need to cover other cases like Date variable VS Numerical variable, Date variable VS Ordinal factor and Nominal factor has more than two categories VS Ordinal factor