tobiasgf / lulu

r package for post-clustering curation of amplicon next generation sequencing data (metabarcoding)
GNU Lesser General Public License v3.0
63 stars 17 forks source link

potential parents with a relative cooccurence < 1.0 are never accepted #8

Open frederic-mahe opened 4 years ago

frederic-mahe commented 4 years ago

Dear @tobiasgf

I think there is a undesired behavior with minimum_ratio when minimum_ratio_type == "min", and to some degree when minimum_ratio_type == "avg". It happens in the code below:

          if (relative_cooccurence >= minimum_relative_cooccurence) {
            cat(paste0(" which is sufficient!"), file = log_con)
            if (minimum_ratio_type == "avg") {
              relative_abundance <-
                mean(otutable[line2, ][daughter_samples > 0]/
                       daughter_samples[daughter_samples > 0])
              cat(paste0("\n", "------mean avg abundance: ",
                         relative_abundance), file = log_con)
            } else {
              relative_abundance <-
                min(otutable[line2, ][daughter_samples > 0]/
                      daughter_samples[daughter_samples > 0])
              cat(paste0("\n", "------min avg abundance: ",
                         relative_abundance), file = log_con)
            }

When minimum_relative_cooccurence is smaller than 1.0 (default is 0.95), the potential parent OTU can be absent from some samples where the daughter OTU is present. A missing parent means that the ratio for that particular sample will be zero. This is were the issue is, the way minimum_ratio is computed, it will identify the samples where the father is missing as the ones with the smallest ratio (zero). This is below minimum_ratio and the father OTU is rejected, despite using minimum_relative_cooccurence.

To solve that, the minimum_ratio should be searched among the non-null ratio values, not all ratio values.

When minimum_ratio_type == "avg", I think the average should be calculated on non-zeros ratio values too.

frederic-mahe commented 1 year ago

From the README:

If the number of samples where the 'potential parent' has a positive presence is below the minimum_relative_cooccurence (default 95%, meaning that 1 in 20 samples are allowed to have no parent presence), the 'potential parent' is rejected.

Contrary to what is indicated in the README, it seems that when using the default minimum_ratio_type = min, cases when the parent is missing in one or more samples are always rejected by lulu (no merging). A parent missing in one sample means that the minimum_ratio value is null, which leads to rejection. In consequence, changing the minimum_relative_cooccurence` value has no effect.

I wrote a toy-example to demonstrate this behavior:

library(lulu)
library(dplyr)  # lulu requires dplyr

matchlist_name <- data.frame(x = "B" , y = "A", z = 99.0)
otutable_name <- data.frame(
    row.names = c("A", "B"),
    s01 = c(0, 1), # <= 'B' present, 'A' absent
    s02 = c(9, 1),
    s03 = c(9, 1),
    s04 = c(9, 1),
    s05 = c(9, 1),
    s06 = c(9, 1),
    s07 = c(9, 1),
    s08 = c(9, 1),
    s09 = c(9, 1),
    s10 = c(9, 1),
    s11 = c(9, 1),
    s12 = c(9, 1),
    s13 = c(9, 1),
    s14 = c(9, 1),
    s15 = c(9, 1),
    s16 = c(9, 1),
    s17 = c(9, 1),
    s18 = c(9, 1),
    s19 = c(9, 1),
    s20 = c(9, 1),
    s21 = c(9, 1),
    s22 = c(9, 0)) # <= 'A' present, 'B' absent

## Note that parent 'A' is absent in one sample where 'B is present',
## and present in one sample where 'B' is absent (same total
## spread). The relative co-occurence is 20 / 21 = 0.95238, which is
## greater than 0.95, the default threshold value.

## bug: no merging with default parameters
## (minimum_ratio = 0)
lulu::lulu(otutable_name, matchlist_name)

## with minimum_ratio_type = "avg": merging as expected
## (minimum_ratio (average) is 8.57)
lulu::lulu(otutable_name, matchlist_name, minimum_ratio_type = "avg")

quit(save = "no")