plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
153 stars 18 forks source link

scDbliFinder.sample is different from the sample column specified in `scDblFinder` function `samples` column ? #66

Closed Yunuuuu closed 1 year ago

Yunuuuu commented 1 year ago

Hi, I have run scDblFinder in "split" smaple mode to detect doublets with following code (since the data is large, I only provide code):

set.seed(221113L)
sce_qc <- scDblFinder::scDblFinder(
    sce_raw[, !sce_raw$low_lib_size],
    clusters = TRUE, dims = 50L, 
    samples = "Sample", multiSampleMode = "split",
    returnType = "sce"
)

When I check the results, the scDblFinder.sample column seems strange:

data.frame(colData(sce_qc)) %>%
    dplyr::select(Sample, scDblFinder.sample) %>% 
    dplyr::filter(Sample != scDblFinder.sample)
# here is some output
                   Sample scDblFinder.sample
AAACCCAAGCCTCTCT-1    B4T              B16T2
AAACCCAAGTGTAGAT-1    B4T                B1T
AAACGCTGTGTATTGC-1    B4T              B14T2
AAAGTGAGTAGATCGG-1    B4T               B16U
AACAAAGGTGGATCGA-1    B4T                B1U
AACAAGAGTCTACATG-1    B4T              B14T1
AACCAACAGGTAAACT-1    B4T                B1T
AACGGGAGTGAGATCG-1    B4T              B14T2
AAGAACATCTCTCGCA-1    B4T               B12T
AAGATAGAGCCTCATA-1    B4T                B1U
AAGATAGAGTAAGACT-1    B4T                B1T
AAGATAGCAAATGGCG-1    B4T               B16U
AAGGAATGTTGAATCC-1    B4T               B12U

I don't know why they are different when I used a "split" mode? From the help page of scDblFinder, "split" mode runs all process separated by samples, I think they should be the same, is it right?

plger commented 1 year ago

Thanks a lot for reporting this, yes they should be the same. Fortunately the error was only in the reporting, and shouldn't have affected the doublet scores.

It error should be fixed now on the github version (would be happy if you could confirm with your dataset), and I'll push it to Bioc devel once the checks have passed.

plger commented 1 year ago

Hi @Yunuuuu , could you confirm that this solved your problem? Will close the issue if there's no answer. Pierre-Luc

Yunuuuu commented 1 year ago

Hi, I downloaded the latest plger/scDblFinder using pak::pkg_install and restart R, it remains here:

image

Yunuuuu commented 1 year ago

I checked the source code of scDblFindeer function, which indicates this has been modified:

image

Yunuuuu commented 1 year ago

I try to understand the code, but I'm not familiar with the internal function: image

when samples is not NULL and returnType is "sce" or "full", following code won't run in scDblFinder funtion:

        if (returnType == "counts") {
            for (s in names(d)) d[[s]]$sample <- s
            return(do.call(cbind, d))
        }
plger commented 1 year ago

You're absolutely right, I did this too quickly... should hopefully be fixed for real in the latest push :)

plger commented 1 year ago

@Yunuuuu , hopefully everything is as expected now?

Yunuuuu commented 1 year ago

I'll try this again @plger

Yunuuuu commented 1 year ago

It remains here: image

the package GithubSHA1 is here: image

Yunuuuu commented 1 year ago

Thanks for the development of this package @plger, I'll do more test this weekend, I cannot find what's wrong now

plger commented 1 year ago

Hi @Yunuuuu , okay now I don't get why you're having this problem, as I can't reproduce it with my toy data. Could you share a minimal example, e.g. SCE with only count matrix and sample id, only 2-300 genes, perhaps subsampling the cells? (you can rename genes & remove other cell metadata if you're worried about the data)

Yunuuuu commented 1 year ago

Is there any method to share rds data ?

plger commented 1 year ago

You can email it to pierre-luc.germain@hest.ethz.ch if it's <20mb, otherwise if you don't have a platform for sharing of larger files you can write me an email and I'll send you some details. Thanks!

Yunuuuu commented 1 year ago

hi, I have uploaded it to the Google Drive,and the link has been emailed to pierre-luc.germain@hest.ethz.ch. I can confirm this data can induce the problem. Thanks!

[R]> set.seed(221113L)
[R]> anyDuplicated(colnames(test_data))
[1] 3466
[R]> sce_qc <- scDblFinder::scDblFinder(
         test_data,
         clusters = TRUE, dims = 50L,
         nfeatures = 2000L,
         samples = "Sample",
         multiSampleMode = "split",
         returnType = "sce"
     )
There were 26 warnings (use warnings() to see them)

[R]> data.frame(colData(sce_qc)) %>%
         dplyr::select(Sample, scDblFinder.sample
     ) %>% 
         dplyr::filter(Sample != scDblFinder.samp
     le) %>% 
         head()
                      Sample scDblFinder.sample
TTTCCTCTCAACTCTT-1   sample3            sample2
GTCAAACTCCACGAAT-1   sample3            sample1
GGTTAACCAGCGCTTG-1   sample3            sample2
AGCATCATCGGCTTGG-1.1 sample3            sample1
TGGAACTGTGACAGCA-1.1 sample3            sample1
Yunuuuu commented 1 year ago

It seems the column cell names matters, for I have some duplicated column names ? By changing colnames with colnames(test_data) <- paste0("cell_", seq_len(ncol(test_data))), this problem can be figured out.


[R]> colnames(test_data) <- paste0("cell_", seq_l
     en(ncol(test_data)))
[R]> anyDuplicated(colnames(test_data)) 
[1] 0
[R]> set.seed(221113L)

[R]> sce_qc <- scDblFinder::scDblFinder(
         test_data,
         clusters = TRUE, dims = 50L,
         nfeatures = 2000L,
         samples = "Sample",
         multiSampleMode = "split",
         returnType = "sce"
     )
There were 28 warnings (use warnings() to see them)

[R]> # logNormCounts
     data.frame(colData(sce_qc)) %>%
         dplyr::select(Sample, scDblFinder.sample
     ) %>% 
         dplyr::filter(Sample != scDblFinder.samp
     le) %>% 
         head()
[1] Sample             scDblFinder.sample
<0 rows> (or 0-length row.names)
plger commented 1 year ago

Ok, thanks @Yunuuuu , that explains a lot. I'm afraid I'm going to have to throw an error msg on duplicated colnames, because I need to match the cells with the original object (to provide the full original object with added slots).

Yunuuuu commented 1 year ago

@plger Thanks a lot, enforcing unique colnames have already solved this.