laresbernardo / lares

Analytics & Machine Learning R Sidekick
https://laresbernardo.github.io/lares/
233 stars 49 forks source link

How to handle dataset with NA #30

Closed abbaslab closed 3 years ago

abbaslab commented 3 years ago

Hello- really nice package. I am having issues when my matrix has NA values obstructing the use of any of the lares functionality. How can I overcome the "not enough finite observation". I tried pvalue=FALSE option but it generated an error indicating that value is matched by multiple others. Would appreciate a work-around this. I think the error is coming from the cor.test. Ideally, I'd like to still calculate pvalues while accounting for only those paired samples that have non-NA values, similar to when using cor (use="pairwise.complete.obs") Thank you

laresbernardo commented 3 years ago

Hi @cBioPort Could you please share a reproducible example? I think I know where might be the issue but I need a reproducible example for me to fix it and then share the solution/fix. I'd gladly check on that and fix it ASAP. Sorry for the delay, I was on vacations! :)

laresbernardo commented 3 years ago

@cBioPort still having this issue?

abbaslab commented 3 years ago

Thanks for your response. no, the issue has not been resolved. I am not sure if the example below is clear, briefly, its a data frame (df) of values that I would like to calculate correlations with focusing on Test2. However, there are several NA's in the datasets.. of Cours,e I have many more rows and columns and below is just an excerpt. Since there are NAs/incomplete values, I guess the error listed at the bottom of this response. Please let me know if this is clear.

I think it has to do with the NA values and its not writing a correlation of the full data.

` corr_var(df, # name of dataset Test2, # name of variable to focus on top = 20, # display top 5 correlations,

) `

       Test1     Test2 Test3 Test4 Test5 Test6

Sample 1 186.5790 230.8611 104.28005 NA 290 NA 239.4563 Sample2 169.9229 251.8448 143.89394 180 NA NA 179.1268 Sample3 118.0477 193.7258 63.59918 NA 300 NA 115.6406 Sample4 205.6939 248.9553 271.52190 282.1907

Couldn't calculate p-values: Error in cor.test.default(x, y, method = method, conf.level = 0.95): not enough finite observations

To continue, try 'pvalue' = FALSE and/or check your data. Error in data.frame(variables = colnames(rs$cor), corr = rs$cor[, c(var)], : arguments imply differing number of rows: 230, 0

laresbernardo commented 3 years ago

Ok, @cBioPort, thanks for letting me know. Could you please share a reproducible example as I do not have your data to replicate your case?

laresbernardo commented 3 years ago

Hi @cBioPort There is no Test2 column in your data.frame and there is an un-named column as well (the last one). How do you import that txt file exactly so I can be able to replicate your example? csv is friendlier.

abbaslab commented 3 years ago

Can you try this one? test.txt

laresbernardo commented 3 years ago

OK! There are no rows with no missing values. If your vectors do not contain enough non-NA values (less than 3), the function will return that error. So there is no bug here but not enough non-NA rows! I'll add a friendlier message that will help users debug these cases.

abbaslab commented 3 years ago

In each row, there are NA values. However, the majority of the values are there. When using the cor function, we can use cor (use="pairwise.complete.obs") which would allow to correct for this. Does this mean that in your code we can't overcome this?

laresbernardo commented 3 years ago

In each row, there are NA values. However, the majority of the values are there. When using the cor function, we can use cor (use="pairwise.complete.obs") which would allow to correct for this. Does this mean that in your code we can't overcome this?

Actually, that's exactly how we calculate correlations. Check the code here. Do you propose any other way we could calculate crossed-correlations with your dataset? Feel free to use the existing code, adapt it so it runs for your case, and then we can merge them. Note that we CAN calculate the corr(df, top = 20) values, but not the pvalues.

laresbernardo commented 3 years ago

Reproducible example:

df <- read.table("test.txt")
corr_cross(df, top = 20, pvalue = FALSE)
corr_var(df, Test2, top = 20, pvalue = FALSE)

Gives you these outputs:

Screen Shot 2021-04-07 at 14 08 07 Screen Shot 2021-04-07 at 14 07 55

If no pvalue = FALSE is passed, you'll get (for this dataset):

Error in corr(df, ignore = ignore, limit = limit, ...) : 
  Can't calculate pvalues: There are not enough rows (>2) without missing observations.
Try adding 'pvalue = FALSE' or fixing your dataset.