Critical issue with mergeFeaturesByPrevalence

antagomir commented 1 year ago

Unexpected and potentially critical behavior with mergeFeaturesByPrevalence.

Let's first prepare example data.

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")

Here we merge features by prevalence, and return the result at the family level.

nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
#> 35

Here we merge features first by Family level grouping, then by prevalence.

altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
#>2

Same happens when we treat Family as a group, rather thank rank:

altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
#>2

Interestingly, checking the prevalences manually yields even different numbers.. :-D

sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
#>26

sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
#>19

All these approaches should be safely expected to yield the same result, I think. Failing to do so may lead to critical bugs in analyses.

Perhaps the reason is obvious but at first this seems very confusing.

To be investigated and fixed as a top priority.

antagomir commented 1 year ago

I suggest that @ake123 checks and @TuomasBorman can review?

antagomir commented 1 year ago

Good to first diagnose the issue and discuss the solution.

In both cases, the low-level features should be aggregated to the higher level (here, Family) before agglomerating the rare groups. This is one possible source for the behavior.

antagomir commented 1 year ago

Also some additional unit tests will be warranted here.

ake123 commented 1 year ago

One extra thing thing I noticed on the mergeFeaturesByPrevalence is that it actually takes total row number but not the actual values. In the above example it can be seen that it has 35 but it should be only be 34.

nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[1] 35

Also I checked with other ranks such as "Class" and all the numbers are equal except the one I mentioned above. All look good on other ranks and which I have checked such as "Class","Phylum". The numbers differ when it comes to "Family" and "Order" and still couldn't find out why and I keep on checking on the possible reasons.

In the helper function .agg_for_prevalence I see the following

args[["na.rm"]] <- TRUE

This means any missing values in the data will be ignored during the calculation. i tested it by setting it to FALSE and the number went down from 35 to 27.

antagomir commented 1 year ago

Not sure where the NAs come fromsince the assay itself does not seem to have any NAs:

sum(is.na(assay(tse, "relabundance"))) 0

Usually good strategy is to calculate manually what is needed and then see where the differences are in each deviating case.

ake123 commented 1 year ago

Here are the cases that the numbers fluctuate First case is when the user includes na.rm = TRUE for mergeFeaturesByRank or na.rm=FALSE for mergeFeaturesByPrevalence

nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100,na.rm=FALSE))
[1] 27

altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family",na.rm = TRUE)
> sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 19

altExp(tse, "Family2") <- mergeFeatures(tse, f="Family",na.rm = FALSE)

> sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 19

Second case Also when removing NA from empty.fields in mergeFeaturesByRank with signature = c(x = "SummarizedExperiment")

empty.fields = c(NA, "", " ", "\t", "-", "_")

@TuomasBorman what's your idea and suggestion.

antagomir commented 1 year ago

This is a critical high-importance issue to sort out. I suggest that @TuomasBorman will prioritize to review this one with Jeba asap.

TuomasBorman commented 1 year ago

Great news, mergeFeaturesByPrevalence is working as expected.

Prepare data

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")

Calculate number of rows after merging with mergeFeaturesByPrevalence

tse1 <- mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100)
nrow(tse1)
#> 35

When we used mergeFeaturesByRank + mergeFeaturesByPrevalence, we noticed that the number of rows do not match with the original one (35 vs 2). However, mergeFeaturesByRank has also rank parameter which default value is the highest rank, in this case Kingdom.

altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
tse2 <- mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100, rank = "Family")
nrow(tse2)
#> 35

We can also see that assays match between these two objects.

# Check assay between tse1 and tse3 --> should be the same
all( assay(tse1) == assay(tse3) )

--> mergeFeaturesByRank + mergeFeaturesByPrevalence is working as expected. The result is the same.

But then we also noticed that there is problem with mergeFeatures + mergeFeaturesByPrevalence.

altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")

tse3 <- mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100, rank = "Family")
nrow(tse3)
#>36

If we do the same trick as we did to solve the first issue, the number of rows is 36. There is one additional feature. Why?

mergeFeatures works differently than mergeFeaturesByRank. The latter finds the first available higher rank to agglomerate the feature, but mergeFeatures just finds all the unique groups in Family level and merges into these groups. Moreover, it completely drops off NA group (features that have NA value in Family level).

# The number of rows differ. If feature is NA in Family level, it is completely dropped off.
altExp(tse, "Family")
altExp(tse, "Family2")

This is why these sums differ:
sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
#>26

sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
#>19

But when we agglomerate to a level that has no NA values such as Phylum in this case, the result is same.

sum(rowMeans(assay(agglomerateByRank(tse, rank="Phylum"), "relabundance") > 0.5/100) > 0.2)
#>10

sum(rowMeans(assay(mergeFeatures(tse, f="Phylum"), "relabundance") > 0.5/100) > 0.2)
#>10

Things to discuss:

Should the default value of rank in mergeFeaturesByRank be the highest rank? I think it should be no taxonomy rank agglomeration, i.e. NULL.
Should mergeFeatures return also NA group (the features that has no grouping value)? I think there should be na.rm parameter, because user might want to exclude features with no group. The default could be that NA group is included.

antagomir commented 1 year ago

I agree that the default should be no agglomeration. Otherwise this can lead to confusing results.
Agreed.

Note: The mergeFeaturesByRankhas also the argument onRankOnly which limits agglomeration to the specified rank only (not looking at higher-level ranks in the case of NAs). If this is FALSE, then the results won't be comparable to mergeFeatures in a general case, even if the NA group is added to mergeFeatures:

altExp(tse, "Family3") <- mergeFeaturesByRank(tse, rank="Family", onRankOnly=TRUE)
altExp(tse, "Family3b") <- mergeFeaturesByRank(tse, rank="Family", onRankOnly=FALSE)
c(nrow(altExp(tse, "Family2")),
  nrow(altExp(tse, "Family3")),
  nrow(altExp(tse, "Family3b")))
# [1] 334 334 603

By default onRankOnly=FALSE. This is potentially useful since it helps to keep potentially valuable data in, and the user can make a conscious decision to discard those by setting this to TRUE. On the other hand, the default behavior is now something else than what the function name says (because it returns also other features than ones on the specific rank) and it leads to results that differ from those obtained by other similar means (e.g. mergeFeatures).

Therefore, one more thing to discuss:

Should we change the default to onRankOnly=TRUE? I tend to think that this would be useful for consistency reasons. The examples could be developed to make clear that both options are available.

antagomir commented 1 year ago

Yet another observation.

The order of rows is returned differently in mergeFeatures (Family2) and mergeFeaturesByRank (Family3), although the data is otherwise the same:

Here, rownames do not match but the same rownames are in both:

> all(rownames(altExp(tse, "Family2")) == rownames(altExp(tse, "Family3")))
[1] FALSE
> all(rownames(altExp(tse, "Family2")) %in% rownames(altExp(tse, "Family3")))
[1] TRUE
> all(rownames(altExp(tse, "Family3")) %in% rownames(altExp(tse, "Family2")))
[1] TRUE

Also the assay values match when we make sure the rows are in the same order:

> all(assay(altExp(tse, "Family2")) == assay(altExp(tse, "Family3")))
[1] FALSE

> all(assay(altExp(tse, "Family2")[o,]) == assay(altExp(tse, "Family3")[o,]))
[1] TRUE

The Results from mergeFeatures are in alphabetical order:

all(rownames(altExp(tse, "Family2"))== sort(rownames(altExp(tse, "Family2"))))
TRUE

The results from mergeFeaturesByRank are in the same order than in the original rowData, when the NA group is removed:

 all(rownames(altExp(tse, "Family3"))== unique(rowData(tse)$Family)[-1])
[1] TRUE

It would be beneficial to harmonize the sorting logic.

The additional thing to discuss:

Shall we order alphabetically or respect the rowData order? The latter would feel more natural but it might have some issues when onRankOnly=FALSE, for instance. On the other hand we do not have clear examples of any real issues with that, it could be possible to just use the rowData order for these cases and later reconsider this shall needs arise.
Shall we review if some other functions have similar sorting issues? I tend to think that it would be good to check on the same go, to maintain internal logic.

antagomir commented 10 months ago

Up

Daenarys8 commented 9 months ago

if i gather correctly, there are about 4 indications on the changes to be made.

[ ] The default value of rank in mergeFeaturesByRank should be NULL
[ ] Add na.rm parameter to mergeFeatures
[ ] Change the default to onRankOnly=TRUE in mergeFeaturesByRank
[ ] Respect the rowData order

TuomasBorman commented 9 months ago

Yes, and then if you can also check other functions if they have similar sorting issues

antagomir commented 1 month ago

Status of this @Daenarys8 ?

TuomasBorman commented 1 month ago

This was fixed

microbiome / mia

Critical issue with mergeFeaturesByPrevalence #419