Open lwaldron opened 5 months ago
Likely cause - the relab_sub_t_df
in the call raw_lda_scores <- ldaFunction(relab_sub_t_df, lgroupf)
within lefser()
looks like this, when these should be relative abundances:
Actually these relative abundances are correct, I forgot that they are scaled to add to 1e6. I think this is real collinearity, caused by the presence of synthetic clades (e.g. "Bacteria") where a parent node has only one child, or two children but one is dominant. Two TODOs:
It seems like you are solving this with the get_terminal_nodes
function. Is this something you'd like to add to the lefser
function?
I also want to clarify points 1 & 3: if the terminal node of an input data is a mix of strain, species, and genus, for example, what will be the recommendation?
9e1afa161f148ce32eb7a370bf46b8eaff9ac574
One of the features of lefse-conda is determining which clades are differentially abundant. For example this result. This clade was created by the lefse program.
I think the abundance of synthetic clades (e.g., Bacteria, etc.) is necessary for the cladogram if the internal nodes are of interest. Otherwise, the cladogram could just depict which taxa in the terminal nodes are differentially abundant.
If the taxonomy is included in the rowData instead of the rownames and we restrict (recommend) the input to only terminal nodes at the same taxonomic level we could use the mia package:
library(mia)
data("GlobalPatterns")
l <- splitByRanks(GlobalPatterns)
l$features <- GlobalPatterns
mergedSE <- mergeSEs(l, collapse.cols = TRUE)
## Instead of merging, the lefse analysis could be run at all taxonomic levels individually
This is a new warning being generated by the tests, and I don't think there are actually should be collinear variables here so I'm not sure what's happening.
Created on 2024-06-15 with reprex v2.1.0