variables are collinear

waldronlab / lefser

R implementation of the LEfSe method

https://waldronlab.io/lefser/

48 stars 5 forks source link

variables are collinear #41

Open lwaldron opened 5 months ago

lwaldron commented 5 months ago

This is a new warning being generated by the tests, and I don't think there are actually should be collinear variables here so I'm not sure what's happening.

suppressPackageStartupMessages(library(lefser))
data("zeller14")
zellersub <- zeller14[1:150, zeller14$study_condition != "adenoma"]
## subsetting a DataFrame with NULL
zellersub <- relativeAb(zellersub)
results <- lefser(zellersub, groupCol = "study_condition", blockCol = NULL)
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> Warning in lda.default(x, grouping, ...): variables are collinear

^{Created on 2024-06-15 with reprex v2.1.0}

lwaldron commented 5 months ago

Likely cause - the relab_sub_t_dfin the call raw_lda_scores <- ldaFunction(relab_sub_t_df, lgroupf) within lefser() looks like this, when these should be relative abundances:

lwaldron commented 5 months ago

Actually these relative abundances are correct, I forgot that they are scaled to add to 1e6. I think this is real collinearity, caused by the presence of synthetic clades (e.g. "Bacteria") where a parent node has only one child, or two children but one is dominant. Two TODOs:

modify tests and examples to do the analysis at species level only.
make warning more helpful, to explain this likely cause of collinearity and recommend analyzing a single taxonomic level only
modify documentation to recommend analyzing a single taxonomic level only

shbrief commented 5 months ago

It seems like you are solving this with the get_terminal_nodes function. Is this something you'd like to add to the lefser function?

shbrief commented 5 months ago

I also want to clarify points 1 & 3: if the terminal node of an input data is a mix of strain, species, and genus, for example, what will be the recommendation?

shbrief commented 5 months ago

9e1afa161f148ce32eb7a370bf46b8eaff9ac574

sdgamboa commented 2 months ago

One of the features of lefse-conda is determining which clades are differentially abundant. For example this result. This clade was created by the lefse program.

I think the abundance of synthetic clades (e.g., Bacteria, etc.) is necessary for the cladogram if the internal nodes are of interest. Otherwise, the cladogram could just depict which taxa in the terminal nodes are differentially abundant.

If the taxonomy is included in the rowData instead of the rownames and we restrict (recommend) the input to only terminal nodes at the same taxonomic level we could use the mia package:

library(mia)
data("GlobalPatterns")
l <- splitByRanks(GlobalPatterns)
l$features <- GlobalPatterns
mergedSE <- mergeSEs(l, collapse.cols = TRUE)
## Instead of merging, the lefse analysis could be run at all taxonomic levels individually