ntdyjack / fasthplus

Creative Commons Attribution 4.0 International
3 stars 0 forks source link

Problems in hpb() when one or more clusters are very small #5

Open lcolladotor opened 2 years ago

lcolladotor commented 2 years ago

Hi,

We ran into an scenario where a cluster is small such that https://github.com/ntdyjack/fasthplus/blob/main/R/bsf.R#L20 fails. That's because the small cluster is smaller than the t divided by the number of clusters. That is, t / length(q).

We can avoid this by choosing t with code like this:

find_t <- function(L, proportion = 0.05) {
    initial_t <- floor(length(L) * proportion)
    smallest_cluster_size <- min(table(L))
    n_labels <- length(unique(L))
    ifelse(smallest_cluster_size > (initial_t / n_labels), initial_t, smallest_cluster_size * n_labels)
}

Otherwise, when t / length(q) is larger than the smallest cluster size, then you run into an error with sample() because replace = FALSE by default. Alternatively, https://github.com/ntdyjack/fasthplus/blob/736494f1534ca04db14775210fd1bcbfdfe7864e/R/bsf.R#L20 could be changed to have sample(replace = TRUE) but I don't know if this would break your bootstrap assumptions.

find_t <- function(L, proportion = 0.05) {
    initial_t <- floor(length(L) * proportion)
    smallest_cluster_size <- min(table(L))
    n_labels <- length(unique(L))
    ifelse(smallest_cluster_size > (initial_t / n_labels), initial_t, smallest_cluster_size * n_labels)
}

set.seed(20220308)
n_labels <- 7
L <- sample(letters[seq_len(n_labels)], 1e4, replace = TRUE)
min(table(L))
#> [1] 1382
find_t(L, 1)
#> [1] 9674
find_t(L, (find_t(L, 1) + n_labels)/1e4)
#> [1] 9674
find_t(L, (find_t(L, 1) - n_labels)/1e4)
#> [1] 9667
find_t(L, 0.05)
#> [1] 500

## Small reprex to force the error
label_list <- split(seq_len(length(L)), L)
q <- label_list
t <- find_t(L, 1) + n_labels
p <- as.vector(sapply(q, function(x) sample(x,round(t/length(q)))))
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
traceback()
#> No traceback available

Created on 2022-03-08 by the reprex package (v2.0.1)

EDIT start

Looks like reprex doesn't include the traceback() output. Here it is:

Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'
> traceback()
6: sample.int(length(x), size, replace, prob)
5: sample(x, round(t/length(q))) at #1
4: FUN(X[[i]], ...)
3: lapply(X = X, FUN = FUN, ...)
2: sapply(q, function(x) sample(x, round(t/length(q))))
1: as.vector(sapply(q, function(x) sample(x, round(t/length(q)))))

EDIT end

The problem with the above solution, is that it could result in very low values of t when for example you have clusters that are very small compared to others. Abby @abspangler13 has a case where we have 15 clusters and the smallest one has a size of 8. So the smallest t would then be 15 * 8 = 120 as shown below, which would be very small.

> find_t(L=colData(spe)[[paste0("bayesSpace_harmony_",k)]],proportion = 0.01)
[1] 120
> sort(table(colData(spe)[[paste0("bayesSpace_harmony_",k)]]))

    6    12    14     1    11    15    13     2    10     9     5     8     3 
    8   167   519  1315  3501  3579  4104  4851  4998  9333  9492 11687 13310 
    7     4 
14535 18175 

Best, Leo

lcolladotor commented 2 years ago

In the above case, we are considering simply dropping that tiny cluster so we can use a larger t without any changes to fasthplus.