Suggestion: Tweaked or extended comparison to `DropletUtils::hashedDrops()`, perhaps in OSCA?

PeteHaitch commented 2 years ago

Thank you for adding the comparison to DropletUtils::hashedDrops() in the vignette. Something I've found with applying hashedDrops() to a number of datasets now is that, unsurprisingly, tweaking the parameters can substantially improve the results, especially for 'low quality' HTOs. In particular, reducing the confident.min parameter seems to often help for these datasets.

This can also be seen in a tweaked version of what you have in the vignette (I also added a comparison of each method's results to the ground truth). Upon changing min.confident = 2 (default) to min.confident = 1, the 'uncertain' droplets from hashedDrops() are all almost all correctly re-assigned as HTO_3.

suppressPackageStartupMessages(library(demuxmix))
suppressPackageStartupMessages(library(DropletUtils))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(ggplot2))

# Simulate data
set.seed(5636)
class <- rbind(
  c(rep( TRUE, 200), rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 50)),
  c(rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 200), rep( TRUE, 50)),
  c(rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 50))
)
simdata <- dmmSimulateHto(
  class = class,
  mu = c(600, 400, 200),
  theta = c(25, 15, 25),
  muAmbient = c(30, 30, 60),
  thetaAmbient = c(20, 10, 5),
  muRna = 3000,
  thetaRna = 30
)
hto <- simdata$hto
rna <- simdata$rna

# Default demuxmix
dmm <- demuxmix(hto, rna = rna)
classLabels <- dmmClassify(dmm)
dmux <- classLabels$HTO
dmux[classLabels$Type == "multiplet"] <- "multiplet"

# Default hashedDrops()
hd <- hashedDrops(hto)
hdrops <- rownames(hto)[hd$Best]
hdrops[!hd$Confident] <- "uncertain"
hdrops[hd$Doublet] <- "multiplet"

# Tweaked hashedDrops()
hd_tweaked <- hashedDrops(hto, confident.min = 1)
hdrops_tweaked <- rownames(hto)[hd_tweaked$Best]
hdrops_tweaked[!hd_tweaked$Confident] <- "uncertain"
hdrops_tweaked[hd_tweaked$Doublet] <- "multiplet"

# Comparisons
comp <- melt(as.matrix(table(dmux, hdrops)))
colnames(comp) <- c("demuxmix", "hashedDrops", "Count")
comp$color <- ifelse(comp$Count > 100, "black", "white")
ggplot(comp, aes(x = demuxmix, y = hashedDrops, fill = Count)) +
  geom_tile() +
  scale_fill_viridis_c() +
  geom_text(aes(label = Count), col = comp$color, size = 5)


comp <- melt(as.matrix(table(dmux, hdrops_tweaked)))
colnames(comp) <- c("demuxmix", "hashedDropsTweaked", "Count")
comp$color <- ifelse(comp$Count > 100, "black", "white")
ggplot(comp, aes(x = demuxmix, y = hashedDropsTweaked, fill = Count)) +
  geom_tile() +
  scale_fill_viridis_c() +
  geom_text(aes(label = Count), col = comp$color, size = 5)


comp <- melt(as.matrix(table(dmux, simdata$groundTruth)))
colnames(comp) <- c("demuxmix", "groundTruth", "Count")
comp$color <- ifelse(comp$Count > 100, "black", "white")
ggplot(comp, aes(x = demuxmix, y = groundTruth, fill = Count)) +
  geom_tile() +
  scale_fill_viridis_c() +
  geom_text(aes(label = Count), col = comp$color, size = 5)


comp <- melt(as.matrix(table(hdrops, simdata$groundTruth)))
colnames(comp) <- c("hashedDrops", "groundTruth", "Count")
comp$color <- ifelse(comp$Count > 100, "black", "white")
ggplot(comp, aes(x = hashedDrops, y = groundTruth, fill = Count)) +
  geom_tile() +
  scale_fill_viridis_c() +
  geom_text(aes(label = Count), col = comp$color, size = 5)


comp <- melt(as.matrix(table(hdrops_tweaked, simdata$groundTruth)))
colnames(comp) <- c("hashedDropsTweaked", "groundTruth", "Count")
comp$color <- ifelse(comp$Count > 100, "black", "white")
ggplot(comp, aes(x = hashedDropsTweaked, y = groundTruth, fill = Count)) +
  geom_tile() +
  scale_fill_viridis_c() +
  geom_text(aes(label = Count), col = comp$color, size = 5)

^{Created on 2022-09-05 with reprex v2.0.2}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Big Sur ... 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2022-09-05 #> pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> beachmat 2.13.4 2022-06-21 [1] Bioconductor #> Biobase * 2.57.1 2022-05-19 [1] Bioconductor #> BiocGenerics * 0.43.1 2022-07-26 [1] Bioconductor #> BiocParallel 1.31.12 2022-08-03 [1] Bioconductor #> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.2.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> DelayedArray 0.23.1 2022-07-28 [1] Bioconductor #> DelayedMatrixStats 1.19.0 2022-04-26 [1] Bioconductor #> demuxmix * 0.99.5 2022-09-04 [1] Bioconductor #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> dplyr 1.0.10 2022-09-01 [1] CRAN (R 4.2.0) #> dqrng 0.3.0 2021-05-01 [1] CRAN (R 4.2.0) #> DropletUtils * 1.17.1 2022-08-12 [1] Bioconductor #> edgeR 3.39.6 2022-08-08 [1] Bioconductor #> evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.1) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> GenomeInfoDb * 1.33.5 2022-08-10 [1] Bioconductor #> GenomeInfoDbData 1.2.8 2022-04-13 [1] Bioconductor #> GenomicRanges * 1.49.1 2022-08-18 [1] Bioconductor #> ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0) #> HDF5Array 1.25.2 2022-08-03 [1] Bioconductor #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0) #> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0) #> IRanges * 2.31.2 2022-08-18 [1] Bioconductor #> knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0) #> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0) #> lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> limma 3.53.6 2022-08-21 [1] Bioconductor #> locfit 1.5-9.6 2022-07-11 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0) #> Matrix 1.4-1 2022-03-23 [1] CRAN (R 4.2.1) #> MatrixGenerics * 1.9.1 2022-06-24 [1] Bioconductor #> matrixStats * 0.62.0 2022-04-19 [1] CRAN (R 4.2.0) #> mime 0.12 2021-09-28 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> plyr 1.8.7 2022-03-24 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0) #> RCurl 1.98-1.8 2022-07-30 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> reshape2 * 1.4.4 2020-04-09 [1] CRAN (R 4.2.0) #> rhdf5 2.41.1 2022-06-21 [1] Bioconductor #> rhdf5filters 1.9.0 2022-04-26 [1] Bioconductor #> Rhdf5lib 1.19.2 2022-05-13 [1] Bioconductor #> rlang 1.0.5 2022-08-31 [1] CRAN (R 4.2.0) #> rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> S4Vectors * 0.35.3 2022-09-02 [1] Bioconductor #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> scuttle 1.7.4 2022-08-23 [1] Bioconductor #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> SingleCellExperiment * 1.19.0 2022-04-26 [1] Bioconductor #> sparseMatrixStats 1.9.0 2022-04-26 [1] Bioconductor #> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0) #> stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> SummarizedExperiment * 1.27.2 2022-08-23 [1] Bioconductor #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.32 2022-08-10 [1] CRAN (R 4.2.0) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0) #> XVector 0.37.1 2022-08-25 [1] Bioconductor #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> zlibbioc 1.43.0 2022-04-26 [1] Bioconductor #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

I think using the default value, as you've done in the vignette, is reasonable and fair. I guess you could add a small note to the vignette, but I can see how comparisons to other methods soon get beyond the scope of a vignette. But it could be a great fit for the 'Orchestrating Single Cell Analysis with Bioconductor' (OSCA), in the 'Demultiplexing cell hashes' section. It would also be a good way to promote demuxmix to new users. I'm one of the maintainer's of the OSCA book (although not that particular section) and could probably help facilitate this if it's of interest.

huklein commented 2 years ago

Interesting! I have not explored the different parameters of hashedDrops(). A few general comments on the comparison in the vignette: (i) Obviously, it cannot be concluded that one method works better than the other based on one simulated toy dataset. I wrote at the end that the dataset was simulated according to demuxmix's probabilistic model, and it is not surprising that demuxmix performs better on this dataset. The primary purpose of this section is to mention hashedDrops() as an available alternative and to show that the results from both methods are mainly consistent. (ii) I agree that sticking with the default parameters is probably the best approach when comparing methods. Tuning parameters can be challenging without ground truth, and many users will simply use the default parameters. In my opinion, it is a critical feature of a method that it works well on many datasets with the default settings. (iii) I was curious and ran a benchmark on two real datasets with ground truth, and hashedDrops() achieved a low error rate (only slightly more errors than demuxmix) but removed more droplets as uncertain to achieve the low error rate. When the HTO quality was very good, the differences between the methods were small. I hope I will have time to write this up in more detail soon.

Regarding the OSCA online book, I think that would be great! The chapter could feature both methods on the real cell line mixture dataset, which I left out in the demuxmix vignette. If the other book co-authors agree that adding demuxmix makes sense, I am happy to help if there is anything I can contribute.

PeteHaitch commented 2 years ago

Regarding the OSCA online book, I think that would be great! The chapter could feature both methods on the real cell line mixture dataset, which I left out in the demuxmix vignette. If the other book co-authors agree that adding demuxmix makes sense, I am happy to help if there is anything I can contribute.

That would be great! I suggest opening an issue at https://github.com/OSCA-source/OSCA.advanced where we can draft out a plan and bring in the other authors (mainly Aaron Lun who has written most of the book and developed DropletUtils::hashedDrops()) for consultation.

huklein / demuxmix

Suggestion: Tweaked or extended comparison to `DropletUtils::hashedDrops()`, perhaps in OSCA? #5