Unclear matrix row after agglomerateByRank

maartenciers commented 1 month ago

Describe the bug I noticed some weird behavior after I agglomerate on a Rank where you always get one additional row which makes no sense to me.

To Reproduce

# Agglomerate by Genus
tse_genus = agglomerateByRank(tseObj, rank = "Genus", NArm = TRUE, onRankOnly = TRUE)
head(assay(tse_genus,"counts"))
# Family:Lachnospiraceae                                   15753   <----- Why is this here? This is not Genus level information?
# Genus:[Clostridium] innocuum group                 253
# Genus:[Eubacterium] brachy group                         0
# Genus:[Eubacterium] fissicatena group                   0
# Genus:[Eubacterium] nodatum group                     0
# Genus:[Eubacterium] siraeum group                      54
....

# Family:Lachnospiraceae 15753 This should not be present in the in the assay to begin with or is this usefull for something else that I'm not understanding correctly?

I can't share my data as it is confidential but I took the time to investigate it further an replicated how phyloseq functions agglomerates their object and applied it to TreeSummarizedExperiment objects in a very similar matter. Because I had doubts about the agglomeration in general + it would be cool to understand what happens in the back-end.

##### CUSTOM AGGLOMERATION Functions ######
aggregateMatrixRows = function(matrix = assay(tseObj,"counts"), aggregation_list = spCliques){
    # Initialize an empty list to store aggregated rows
    aggregated_data = list()

    for (key in names(aggregation_list)) {
        # Get the row names to aggregate
        row_names = aggregation_list[[key]]

        # Check if all row names exist in the matrix
        if (all(row_names %in% rownames(matrix))) {
        # Subset the matrix to get the rows to aggregate
        rows_to_aggregate = matrix[row_names, , drop = FALSE]

        # Sum the rows
        aggregated_row = colSums(rows_to_aggregate)

        ## Change the key to last ASV in row_names
        ##key = row_names[length(row_names)]
        # Change the key to ASV with the highest count total (like merge_taxa -> phyloseq)
        # Find the index of the maximum value
        rowsum = rowSums(rows_to_aggregate)
        max_index = which.max(rowsum)
        key = names(max_index)

        # Store the result in the list
        aggregated_data[[key]] = aggregated_row
        } else {
        warning(paste("Some row names in", key, "do not exist in the matrix"))
        }
    }

    # Convert the list to a matrix
    aggregated_matrix = do.call(rbind, aggregated_data)
    # order the aggregated matrix like the original matrix 
    order_indices = na.omit(match(rownames(matrix), rownames(aggregated_matrix)))
    aggregated_matrix =aggregated_matrix[order_indices,]

    return(aggregated_matrix)
}
agglom = function(tseObj = tseObj,
                  rank = "Genus",
                  NArm = TRUE,
                  bad_empty = c(NA, "", " ", "\t"),
                  assay = "counts" # only works for counts at the moment other transformations need to be redone
){
    # Check inputs
    if( !rank %in% taxonomyRanks(tseObj)){
        stop("Bad taxrank argument. Must be among the values of rank_names(physeq)")        
    }

    # Make a vector from the taxonomic data.
    RankIndex = which(taxonomyRanks(tseObj) %in% rank) # I use incase there are multiple elements
    tax_data = as.character(rowData(tseObj)[,RankIndex])
    names(tax_data) = rownames(rowData(tseObj))
    length(tax_data)

    # if NArm is TRUE, remove the empty, white-space, NA values from 
    if (NArm){
        keep_asv = names(tax_data[!(tax_data %in% bad_empty)])
        # subset tseObj
        tseObj = tseObj[rownames(rowData(tseObj)) %in% keep_asv,]
    }

    # Concatenate data up to the taxrank column, use this for agglomeration
    tax_data = rowData(tseObj)[,1:RankIndex,drop = FALSE]
    tax_data = apply(tax_data, 1, function(i){paste(i, sep=";_;", collapse=";_;")})
    head(tax_data)

    # Define the ASV cliques to loop through
    spCliques = tapply(names(tax_data), factor(tax_data), list)

    # Successively merge taxa in tseObj for selected assay similar to merge_taxa function phyloseq
    glom_assay = aggregateMatrixRows(assay(tseObj,assay),spCliques)

    # Create new TreeSE based on the aggloeration
    new_assay = list()
    new_assay[[assay]] = glom_assay
    # create new taxonomy
    row_data = rowData(tseObj)[rownames(glom_assay),]
    # "Empty" the values to the right of the rank, using NA_character_.
    if( RankIndex < length(taxonomyRanks(tseObj)) ){
        badcolumns = taxonomyRanks(tseObj)[(RankIndex+1):length(taxonomyRanks(tseObj))]
        row_data[, badcolumns] = NA_character_
    }
    col_data = colData(tseObj)
    tse = TreeSummarizedExperiment(assay = new_assay,
                                   colData = col_data,
                                   rowData = row_data
                                   )

    return(tse)
}

Run the functions with: (I know it's not the proper way to just create a new tse object and only agglom one assay but it's good enough for a quick test)

tse_genus_test = agglom(tseObj,rank = "Genus", assay = "counts")

I get the exact same matrix in a new object for that assay apart from that one row. To me this proves that agglomeration function works correctly but I don't understand why that row is there so it would be nice if you could elaborate on that part.

Side note: I like how this package is progressing since last time I made an issue, the OMA book is improving!

TuomasBorman commented 1 month ago

Hello!

Thanks for the issue. We have noticed this thing, and we are fixing this so that the function works more intuitively.

When na.rm = FALSE in agglomerateByRank, the function does not remove row that does not have any value in certain rank (in your case, row that do not have genus level ranking). Instead, it finds the lowest available rank (in your case family). The default value for na.rm was FALSE, but we thought that it is better to change to TRUE.

onRankOnly should be used with caution. When FALSE, the function check the whole taxonomy when agglomerating rows. When TRUE, only specified level (in your case genus) will be considered when agglomeration is done. For example, some two bacteria with same genus level might still differ in family level. If onRankOnly=TRUE, these rows will be agglomerated to one row. If onRankOnly=FALSE, this would lead to two unique rows.

So, your code should work if you do (notice that you had typo):

temp <- agglomerateByRank(tse, rank = "Genus", na.rm = TRUE, onRankOnly = FALSE)
altExp(tse, "Genus") <- temp

In future version, we will modify the function so that it defaults to these settings. Hopefully this helps. We are finalizing mia* and OMA this year for formal publication, and these issues and feedback help a lot.

-Tuomas

antagomir commented 1 month ago

Thinking about the onRankOnly=TRUE behavior. The behavior, where two genera with same name but different identities are agglomerated, is imo misleading and potentially harmful.

Shouldn't the method check whether two bacteria with same genus level name might have different family level names? In my opinion, the aggregation to genus level in such case should not be based only on the genus name but on the genus identity (there are two different genus identities if their family information differs). If it was done like this, the function could still a) throw a warning as the situation in unexpected; and b) replace the genus names with another, shortest unique name specifying the genus (this could include the family name).

maartenciers commented 1 month ago

Hey, thank you very much for the explanation. I Agree with @antagomir it would be more intuitive to not have the onRankOnly option but rather the function automatically checking if two genera with same name have different identities. Because this is the whole point of agglomerating by rank imo. I can't think of a situation where combining two genera with different families could be usefull

I would also suggest to change default behavior for the na.rm argument to TRUE as this would be more inline with what you want to investigate as you will be forced to exclude these rows anyways when you make visualizations. Thanks for the fast responses!

TuomasBorman commented 1 month ago

onRankOnly=TRUE works like agglomerateByVariable.


> agglomerateByRank(tse, rank = "Genus", onRankOnly = TRUE)
class: TreeSummarizedExperiment 
dim: 984 26 
metadata(1): agglomerated_by_rank
assays(1): counts
rownames(984): Class:Thermoprotei Genus:4-29 ... Genus:Zplanct13 Genus:Zymomonas
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (984 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL
> agglomerateByVariable(tse, MARGIN = 1, f = "Genus")
class: TreeSummarizedExperiment 
dim: 984 26 
metadata(0):
assays(1): counts
rownames(984): 4-29 4041AA30 ... Zplanct13 Zymomonas
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (984 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL

onRankOnly could be removed, since it does not bring any extra value.

a) throw a warning as the situation in unexpected; and b) replace the genus names with another, shortest unique name specifying the genus (this could include the family name).

The function does the b) but not give warning when onRankOnly=FALSE

TuomasBorman commented 4 weeks ago

@antagomir should we remove onRankOnly?

antagomir commented 4 weeks ago

Ok

microbiome / mia

Unclear matrix row after agglomerateByRank #559