Open maartenciers opened 1 month ago
Hello!
Thanks for the issue. We have noticed this thing, and we are fixing this so that the function works more intuitively.
When na.rm = FALSE
in agglomerateByRank
, the function does not remove row that does not have any value in certain rank (in your case, row that do not have genus level ranking). Instead, it finds the lowest available rank (in your case family). The default value for na.rm
was FALSE
, but we thought that it is better to change to TRUE
.
onRankOnly
should be used with caution. When FALSE
, the function check the whole taxonomy when agglomerating rows. When TRUE
, only specified level (in your case genus) will be considered when agglomeration is done. For example, some two bacteria with same genus level might still differ in family level. If onRankOnly=TRUE
, these rows will be agglomerated to one row. If onRankOnly=FALSE
, this would lead to two unique rows.
So, your code should work if you do (notice that you had typo):
temp <- agglomerateByRank(tse, rank = "Genus", na.rm = TRUE, onRankOnly = FALSE)
altExp(tse, "Genus") <- temp
In future version, we will modify the function so that it defaults to these settings. Hopefully this helps. We are finalizing mia* and OMA this year for formal publication, and these issues and feedback help a lot.
-Tuomas
Thinking about the onRankOnly=TRUE behavior. The behavior, where two genera with same name but different identities are agglomerated, is imo misleading and potentially harmful.
Shouldn't the method check whether two bacteria with same genus level name might have different family level names? In my opinion, the aggregation to genus level in such case should not be based only on the genus name but on the genus identity (there are two different genus identities if their family information differs). If it was done like this, the function could still a) throw a warning as the situation in unexpected; and b) replace the genus names with another, shortest unique name specifying the genus (this could include the family name).
Hey, thank you very much for the explanation. I Agree with @antagomir it would be more intuitive to not have the onRankOnly option but rather the function automatically checking if two genera with same name have different identities. Because this is the whole point of agglomerating by rank imo. I can't think of a situation where combining two genera with different families could be usefull
I would also suggest to change default behavior for the na.rm argument to TRUE as this would be more inline with what you want to investigate as you will be forced to exclude these rows anyways when you make visualizations. Thanks for the fast responses!
onRankOnly=TRUE works like agglomerateByVariable.
> agglomerateByRank(tse, rank = "Genus", onRankOnly = TRUE)
class: TreeSummarizedExperiment
dim: 984 26
metadata(1): agglomerated_by_rank
assays(1): counts
rownames(984): Class:Thermoprotei Genus:4-29 ... Genus:Zplanct13 Genus:Zymomonas
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (984 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL
> agglomerateByVariable(tse, MARGIN = 1, f = "Genus")
class: TreeSummarizedExperiment
dim: 984 26
metadata(0):
assays(1): counts
rownames(984): 4-29 4041AA30 ... Zplanct13 Zymomonas
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (984 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL
onRankOnly could be removed, since it does not bring any extra value.
a) throw a warning as the situation in unexpected; and b) replace the genus names with another, shortest unique name specifying the genus (this could include the family name).
The function does the b) but not give warning when onRankOnly=FALSE
@antagomir should we remove onRankOnly?
Ok
Describe the bug I noticed some weird behavior after I agglomerate on a Rank where you always get one additional row which makes no sense to me.
To Reproduce
# Family:Lachnospiraceae 15753
This should not be present in the in the assay to begin with or is this usefull for something else that I'm not understanding correctly?I can't share my data as it is confidential but I took the time to investigate it further an replicated how phyloseq functions agglomerates their object and applied it to TreeSummarizedExperiment objects in a very similar matter. Because I had doubts about the agglomeration in general + it would be cool to understand what happens in the back-end.
Run the functions with: (I know it's not the proper way to just create a new tse object and only agglom one assay but it's good enough for a quick test)
I get the exact same matrix in a new object for that assay apart from that one row. To me this proves that agglomeration function works correctly but I don't understand why that row is there so it would be nice if you could elaborate on that part.
Side note: I like how this package is progressing since last time I made an issue, the OMA book is improving!