Some issues and questions

tengfei-emory commented 3 weeks ago

Thanks for the continuous development of this package! I am trying to apply it to some real data and here are some issues/questions I'd like to share:

Singular matrix t(X) %*% X in qr.solve() function under the emuFit_micro() function: when there are many categorical variables in the formula, it is possible to encounter singular t(X) %*% X matrix. Is it ok to add a small number to the diagonal of t(X) %*% X as an approximation of the inverse matrix?
zero_comparison column in the coef element of the output: apart from TRUE and FALSE entries, there are also NA entries which I'm not sure how to interpret. Looks like on the line 9 of zero_comparison_check.R, columns with col_shared > 1 are excluded, which will exclude not only the intercept term but also categorical variables with only two levels.
Because there are NAs in the zero_comparison column, the assignment of ind on the line #691 of emuFit.R can also get NAs, which causes error when trying to modify the corresponding p-values. I modified line #693 as coefficients[**which(ind)**, col] <- NA to avoid the error.

svteichman commented 3 weeks ago

Hi, thanks for these really useful questions/comments!

I'll have to think more about this. What error exactly are you encountering? Is your matrix exactly singular, or computationally singular?
The zero-comparison situation can only happen for categorical covariates with more than 2 levels, because when there are only 2 levels, we filter out any categories that have 0 counts for all samples, which means that we will never be comparing two levels of all zeros for a category. However, I see why the NA is confusing. I'll update this so that this is recorded as FALSE instead. I really appreciate you pointing out this output and difficulty with its interpretation!
I'll look into this and modify the source code to avoid it. Once again, thanks for letting us know about it.

svteichman commented 3 weeks ago

Points 2 and 3 are now addressed by PR #79, by changing the value of coef$zero_comparison to FALSE in cases in which it was previously NA.

tengfei-emory commented 3 weeks ago

The singular model matrix issue turns out to be a glitch from my side (there were many categorical variables and some of them completely confounded with each other). The new version addressed the problems and now the function runs without errors.

Many thanks!

svteichman commented 3 weeks ago

That's great, thanks for following up with us!

adw96 commented 2 weeks ago

Thanks for using the package and especially for your helpful feedback, @tengfei-emory! Keep it coming -- we welcome it! 👍

Thanks also to @svteichman for the efficient fix 🙏

@MariaAVC I think it would be great to run a check that the user hasn't input a rank-deficient X. Teng knew exactly how to isolate the problem and why this would cause an issue (he's an expert statistician! 😻 ) but most users wouldn't! I think corncob has such a check -- though I vaguely remember that I didn't love the specific message that was given* -- perhaps we could have one for radEmu? Would you please be able to implement and have @svteichman and @gthopkins co-review?

Thank you all!!!

*once we settle on an error message we're happy with, let's add it to corncob, too. Happy to iterate on it (the 4 of us) over Slack.

statdivlab / radEmu

Some issues and questions #78