laresbernardo / lares

Analytics & Machine Learning R Sidekick
https://laresbernardo.github.io/lares/
233 stars 49 forks source link

corr_var for categorical variables? #35

Closed michellekee closed 3 years ago

michellekee commented 3 years ago

Hi Lares,

Thank you for solving my p-value problem earlier on! I love using the corr_var function, for quick and easy analyses just to know how my data is.

I tried using corr_var with a factor variable, but it required me to list the factor specifically (e.g: gender_male). This caused the error message of not having enough observations to plot, since I wanted to plot only the max_pvalue = 0.05. However, if I used corr_cross, I would see that the gender is specifically correlated with my list of 60 variables.

Is there a similar way for corr_var to work with categorical variables like how it works with continuous variables? We don't need to know which factor in the categorical variables are correlated, just need to know which variables are correlated with the categorical variable of interest. :)

Hope to hear from you soon!

Thank you!

laresbernardo commented 3 years ago

Hi Michelle. I’m glad that worked out for you. Have you tried using the corr_cross function defining the contains parameter? That may work out for you.

On 10 Oct 2021, at 5:42 AM, Michelle @.***> wrote:

 Hi Lares,

Thank you for solving my p-value problem earlier on! I love using the corr_var function, for quick and easy analyses just to know how my data is.

I tried using corr_var with a factor variable, but it required me to list the factor specifically (e.g: gender_male). This caused the error message of not having enough observations to plot, since I wanted to plot only the max_pvalue = 0.05. However, if I used corr_cross, I would see that the gender is specifically correlated with my list of 60 variables.

Is there a similar way for corr_var to work with categorical variables like how it works with continuous variables? We don't need to know which factor in the categorical variables are correlated, just need to know which variables are correlated with the categorical variable of interest. :)

Hope to hear from you soon!

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

michellekee commented 3 years ago

Hi Lares,

I tried corr_cross, but was hoping to get an overview-like plot as in corr_var. I'll try corr_cross again. :)

I'm sorry to hijack on this, but I seem to be having an error msg now with corr_var. I could plot without problems before

Demo_Parenting %>% corr_var(GA, ignore = c("a","b","c","d"), method = "spearman", plot = T, pvalue =T, max_pvalue = 0.05)

but it is giving me an error msg now.

`[1] variables corr pvalue

<0 rows> (or 0-length row.names) Warning message: In corr_var(., GA, ignore = c("a", "b", "c", : There are not enough observations to plot. Check your 'max_pvalue' input` However, when plot = F, it shows me the columns with p-values that are < 0.05. I've saved my previous images but I'm getting the new errors now as I'm knitting the .rmd. Please kindly advise. Thank you!
laresbernardo commented 3 years ago

I've just updated corr_* functions to be a bit more efficient but not sure if I might have fixed this issue. Are you able re-check installing the latest dev version and if not, sharing the dataset with me so I can dig deeper?

michellekee commented 3 years ago

I've installed the latest dev version and tried with the command below:

df %>% corr_var(V4, ignore=c("V5","V6","V7","V8"), method = "spearman", plot = T, pvalue =T, max_pvalue = 0.05)

It still gives an error.

`[1] variables corr pvalue

<0 rows> (or 0-length row.names) Warning message: In corr_var(., V4, ignore = c("V5", "V6", "V7", "V8"), method = "spearman", : There are not enough observations to plot. Check your 'max_pvalue' input` But when plot = F, it shows me the columns that with p-values < .05. Attached is the dataset. :) Thank you so much! [df.csv](https://github.com/laresbernardo/lares/files/7326515/df.csv) P/S: corr_cross doesn't seem to have this problem though, but for some variables, the cross-correlations gets repeated again in the plot (but in the reverse order..
laresbernardo commented 3 years ago

Thanks for sharing your dataset. Huge difference debugging.

As you might have discovered already, the function transforms categorical variables with one hot encoding (ohse()) so it can calculate correlations between numerical values. What happened here is that V4 is actually categorical and you must select a transformed variable: if you run colnames(ohse(df)) you'll notice you now have V4_Female and V4_Male. Additionally you can pass the redundant = TRUE parameter to also get V4_NAs as the third option if required explicitly.

Knowing that, this works:

corr_var(df, V4_Female,
         ignore = c("V5","V6","V7","V8"),
         method = "spearman",
         plot = TRUE, pvalue = TRUE,
         max_pvalue = 1)

Now, if you reduce max_pvalue = 0.05, you'll only get the correlation between each of the categorical variables:

        variables      corr        pvalue
V4_Male   V4_Male -0.689684 1.014694e-210
V4_NAs     V4_NAs -0.377415  1.313741e-51

The actual bug here is that, for computing reasons, I ignored the pvalue = TRUE case when printing the plot because I do not show that information on the plot. BUT, if you use the max_pvalue filter, you obviously need those values. Fixed in the latest dev version. Could you please retry and let me know if that worked out for you?

michellekee commented 3 years ago

Hi,

I used install_github("laresbernardo/lares") to install the new dev version and restarted R. (Hope this is correct!)

It worked now for V4. However, when I tried again with the following, which worked previously.

corr_var(df, V7, ignore = c("V5","V8", "V9", "V10", "V11", "V12", "V13"), method = "spearman", plot = F, pvalue = T, max_pvalue = 0.05)

I had initially gotten the correlations with p < .05. Now I do not get the values, nor the plot.

Could you please kindly advise again? Thank you!

laresbernardo commented 3 years ago

Hi @michellekee If you are looking variable by variable, I strongly suggest running corr_cross() instead, which does that for you. On the other hand, ignoring all those variables in the latest example, we only have 9% of rows with no missing data. This might be part of the problem. Let me take a look into other possible issues on my side to check what's happening and why is it not giving you anything. Will get back to you. Thanks for reporting this problem!

laresbernardo commented 3 years ago

Ok, I've fixed this issue. (For some reason I care to admit I don't quite understand) setting cor's exact parameter default value to FALSE did not return the values as it should. I've changed the default to TRUE and should work as seen on this screenshot. You can always set it back manually to FALSE but kept TRUE as default for now.

Screen Shot 2021-10-13 at 9 02 34 AM

Let me know if it works our for you and if you encounter any other issue.

michellekee commented 3 years ago

It works perfectly now. Thank you so much! You are fast and efficient as always. Truly appreciate this! :)