hansenlab / bsseq

Devel repository for bsseq
35 stars 25 forks source link

BSseq constructor extremely slow #71

Open scottgigante opened 6 years ago

scottgigante commented 6 years ago

I am trying to construct a BSseq object from four data frames with 18519004, 15324125, 17354678 and 17382658 rows respectively (mouse, whole genome CpG methylation.) So far the constructor has been running for 30 hours on a CentOS 7 machine with 8 cores and 64GB of memory. Is this expected?

I construct the BSseq object with the following code:

library(tidyverse)
library(dmrseq)
library(bsseq)

make_bs <- function(df, sampleName) {
  M <- matrix(df$M, ncol = 1)
  colnames(M) <- sampleName
  Cov <- matrix(df$Cov, ncol = 1)
  colnames(Cov) <- sampleName
  BSseq(chr = df$chr, 
        pos = df$start,
        Cov = Cov,
        M = M,
        sampleNames = sampleName)
}

bs <- make_bs(df1, "1") %>%
  bsseq::combine(make_bs(df2, "2")) %>%
  bsseq::combine(make_bs(df3, "3")) %>%
  bsseq::combine(make_bs(df4, "4"))

Here's my sessionInfo:

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /stornext/System/data/apps/R/R-3.5.0/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.5.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] dmrseq_1.0.0                bsseq_1.16.0
 [3] SummarizedExperiment_1.10.1 DelayedArray_0.6.0
 [5] BiocParallel_1.14.1         matrixStats_0.53.1
 [7] Biobase_2.40.0              GenomicRanges_1.32.3
 [9] GenomeInfoDb_1.16.0         IRanges_2.14.10
[11] S4Vectors_0.18.2            BiocGenerics_0.26.0
[13] forcats_0.3.0               stringr_1.3.1
[15] dplyr_0.7.5                 purrr_0.2.5
[17] readr_1.1.1                 tidyr_0.8.1
[19] tibble_1.4.2                ggplot2_2.2.1
[21] tidyverse_1.2.1

loaded via a namespace (and not attached):
  [1] colorspace_1.4-0              XVector_0.20.0
  [3] rstudioapi_0.7                bit64_0.9-8
  [5] interactiveDisplayBase_1.18.0 AnnotationDbi_1.42.1
  [7] lubridate_1.7.4               xml2_1.2.0
  [9] splines_3.5.0                 codetools_0.2-15
 [11] R.methodsS3_1.7.1             mnormt_1.5-5
 [13] jsonlite_1.5                  Rsamtools_1.32.0
 [15] broom_0.4.4                   R.oo_1.22.0
 [17] shiny_1.1.0                   HDF5Array_1.8.0
 [19] compiler_3.5.0                httr_1.3.1
 [21] assertthat_0.2.0              Matrix_1.2-14
 [23] lazyeval_0.2.1                limma_3.36.1
 [25] cli_1.0.0                     later_0.7.3
 [27] htmltools_0.3.6               prettyunits_1.0.2
 [29] tools_3.5.0                   bindrcpp_0.2.2
 [31] gtable_0.2.0                  glue_1.2.0
 [33] GenomeInfoDbData_1.1.0        annotatr_1.6.0
 [35] reshape2_1.4.3                doRNG_1.6.6
 [37] Rcpp_0.12.17                  bumphunter_1.22.0
 [39] cellranger_1.1.0              Biostrings_2.48.0
 [41] nlme_3.1-137                  rtracklayer_1.40.2
 [43] iterators_1.0.9               DelayedMatrixStats_1.2.0
 [45] psych_1.8.4                   rvest_0.3.2
 [47] mime_0.5                      rngtools_1.3.1
 [49] gtools_3.5.0                  XML_3.98-1.11
 [51] AnnotationHub_2.12.0          zlibbioc_1.26.0
 [53] scales_0.5.0                  BSgenome_1.48.0
 [55] BiocInstaller_1.30.0          hms_0.4.2
 [57] promises_1.0.1                rhdf5_2.24.0
 [59] RColorBrewer_1.1-2            yaml_2.1.19
 [61] memoise_1.1.0                 pkgmaker_0.27
 [63] biomaRt_2.36.1                stringi_1.2.2
 [65] RSQLite_2.1.1                 foreach_1.4.6
 [67] permute_0.9-4                 GenomicFeatures_1.32.0
 [69] bibtex_0.4.2                  rlang_0.2.1
 [71] pkgconfig_2.0.1               bitops_1.0-6
 [73] lattice_0.20-35               Rhdf5lib_1.2.1
 [75] bindr_0.1.1                   GenomicAlignments_1.16.0
 [77] bit_1.1-14                    tidyselect_0.2.4
 [79] plyr_1.8.4                    magrittr_1.5
 [81] R6_2.2.2                      DBI_1.0.0
 [83] withr_2.1.2                   pillar_1.2.3
 [85] haven_1.1.1                   foreign_0.8-70
 [87] RCurl_1.95-4.10               modelr_0.1.2
 [89] crayon_1.3.4                  progress_1.1.2
 [91] locfit_1.5-9.1                grid_3.5.0
 [93] readxl_1.1.0                  data.table_1.11.4
 [95] blob_1.1.1                    digest_0.6.15
 [97] xtable_1.8-3                  httpuv_1.4.3
 [99] regioneR_1.12.0               outliers_0.14
[101] R.utils_2.6.0                 munsell_0.4.3
[103] registry_0.5
PeteHaitch commented 6 years ago

Hi Scott,

It's bsseq::combine(), not bsseq::BSseq(), that's ridiculously slow. Still, it shouldn't be taking that long.

I've recently fixed bsseq::combine(), but it's not yet available in the release or devel branch. I'll see if I can push out a fix this week.

If it's urgent, do the following:

  1. Construct a GRanges of loci in each file (i.e. gr1, gr2, gr3, gr4)
  2. Construct a GRanges with all unique loci across your 4 samples (i.e. gr <- unique(c(gr1, gr2, gr3, gr4)).
  3. Pre-allocate M and Cov with nrow equal to the length(gr) and ncol = 4. Fill with zeros.
  4. Fill M and Cov using row indices given by findOverlaps(gr1, gr) for column 1, findOverlaps(gr2, gr) for column 2, etc.
  5. Pass gr, M, Cov to bsseq::BSseq()

That should be very fast (and is basically what the fixed method does).

scottgigante commented 6 years ago

Thanks Pete, I appreciate it!

PeteHaitch commented 6 years ago

Minor thing: as noted in help("BSseq-class", package = "bsseq"), combineList() is preferred to multiple calls to combine(). We have to re-allocate and re-fill M and Cov with each call to combine() which can be kind of sucky when the objects have different loci (whereas we can basically cbind() if all objects have same loci).

scottgigante commented 6 years ago

Thanks for that note. Because I'm scared of GRanges and more familiar with data_frame, I've written the following solution in case anyone wants it. The join operations are probably not as memory efficient as preallocating matrices, but in practice I'm just going to run it overnight anyway.

library(tidyverse)
library(bsseq)
tidy_df <- function(df, sampleName) {
  # new colnames: Cov.sampleName, M.sampleName
  Cov <- rlang::sym(paste0("Cov.",sampleName))
  M <- rlang::sym(paste0("M.",sampleName))
  df %>%
    select(chr, start, end, 
           !!Cov:=Cov, 
           !!M:=M)
}

make_bs <- function(df, sampleName) {
  # select M columns then remove prefix
  M <- df %>% 
    select(starts_with("M.")) %>%
    rename_all(function(x) gsub("M.", "", x)) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x)) %>%
    as.matrix()
  # select Cov columns then remove prefix
  Cov <- df %>% 
    select(starts_with("Cov.")) %>%
    rename_all(function(x) gsub("Cov.", "", x)) %>%
    mutate_all(function(x) ifelse(is.na(x), 0, x)) %>%
    as.matrix()
  BSseq(chr = df$chr, 
        pos = df$start,
        Cov = Cov,
        M = M,
        sampleNames = colnames(M))
}

bs <- tidy_df(df1, "1") %>%
  full_join(tidy_df(df2, "2"),
            by=c("chr", "start", "end")) %>%
  full_join(tidy_df(df3, "3"),
            by=c("chr", "start", "end")) %>%
  full_join(tidy_df(df4, "4"),
             by=c("chr", "start", "end")) %>%
  make_bs()
PeteHaitch commented 6 years ago

I've back-ported the change to combine() and combineList() to the release branch (8494740e5440759cebd4a23be0e20b5d311b2169) . Note that this won't be the exact same patch as what will eventually be available in the next BioC release owing to some other fundamental changes to the internals of bsseq.

This patched version will be available in a day or two (providing all the checks pass) as v1.16.1 in the release branch of BioC, available via BiocInstaller::biocLite("bsseq") (you could also install from GitHub using BiocInstaller::biocLite("hansenlab/bsseq@RELEASE_3_7"), although remember that GitHub != BioC). If you get the chance, would you mind trying it out to see if it helps much for your example and posting back here.

I'd also be interested to know how with the updated code your original example now performs compared to bsseq::combineList(list(make_bs(df1, "1"), make_bs(df2, "2"), make_bs(df3, "3"), make_bs(df4, "4")).

scottgigante commented 6 years ago

Thanks so much for that quick work! I've now got five tests running: my original code, the original code running with combineList, the same two on the GitHub release branch, and my dplyr implementation above. The dplyr implementation has been going for two days now so don't expect to hear from me any time soon!

scottgigante commented 6 years ago

Update: all five implementations hit walltime (four days) and were killed. I'm starting to suspect there is something else at play here.. Any ideas?

PeteHaitch commented 6 years ago

Hmm, doesn't sound right. Any chance you could share the data?

scottgigante commented 6 years ago

I'll send you an email 👍

PeteHaitch commented 6 years ago

Thanks for sharing your data, Scott.

Using bsseq v1.16.1 (the current release version) I can confirm that with your data the following takes forever (I got bored and killed it after running for 10+ hours):

bs <- BSseq(chr = df$chr,
            pos = df$start,
            Cov = Cov,
            M = M,
            sampleNames = colnames(M))

This is badly broken, thanks for the report and sorry for the inconvenience.

I think the major slowdown is because your df contain duplicated loci. The BSseq() constructor collapses these loci by summing the counts at the duplicated loci. However, the method to do this is in v1.16.x is stupidly slow/broken. I've fixed this in the devel version by re-writing this method in R and the performance is much improved:

# Development version (refactor branch)
system.time({
#>    bs <- BSseq(chr = df$chr,
#>                pos = df$start,
#>                Cov = Cov,
#>                M = M,
#>                sampleNames = colnames(M))
#>})
#> Warning message:
#> In BSseq(chr = df$chr, pos = df$start, Cov = Cov, M = M, sampleNames = colnames(M)) :
#>   Detected duplicate loci. Collapsing counts in 'M' and 'Cov' at these positions.
#>   user  system elapsed
#> 33.272  12.885  54.948

The next step will be to re-write it in C/C++, which will avoid some unnecessary memory allocations and should further speed things up.

I'm trying to figure out how easy/hard this fix is to port to the release branch and to make it available as bsseq v1.16.2. It's a little complicated by this code being part of the broader refactor of bsseq.

I'll also note that the revamped BSseq() constructor will still perform a check that the M and Cov matrices are valid, specifically that 0 <= M <= Cov < Inf. It doesn't require that either are integers, so your Nanopore data will be fine. This validity method is written in C++ and isn't going to get much faster; it will remain a deliberately unavoidable (and "slow") part of calling BSseq(). In the development version I'm adding an internal .BSseq() constructor (note the leading .) that will construct a BSseq object without performing this validity check. It's designed for when you already know your data are valid and, as such, it should be used with caution.

kasperdanielhansen commented 6 years ago

I would not add an internal function. Instead I would add an argument like checkArgs=TRUE.

Best, Kasper (Sent from my phone.)

On Jun 20, 2018, at 17:34, Peter Hickey notifications@github.com wrote:

Thanks for sharing your data, Scott.

Using bsseq v1.16.1 (the current release version) I can confirm that with your data the following takes forever (I got bored and killed it after running for 10+ hours):

bs <- BSseq(chr = df$chr, pos = df$start, Cov = Cov, M = M, sampleNames = colnames(M)) This is badly broken, thanks for the report and sorry for the inconvenience.

I think the major slowdown is because your df contain duplicated loci. The BSseq() constructor collapses these loci by summing the counts at the duplicated loci. However, the method to do this is in v1.16.x is stupidly slow/broken. I've fixed this in the devel version by re-writing this method in R and the performance is much improved:

Development version (refactor branch)

system.time({

> bs <- BSseq(chr = df$chr,

> pos = df$start,

> Cov = Cov,

> M = M,

> sampleNames = colnames(M))

>})

> Warning message:

> In BSseq(chr = df$chr, pos = df$start, Cov = Cov, M = M, sampleNames = colnames(M)) :

> Detected duplicate loci. Collapsing counts in 'M' and 'Cov' at these positions.

> user system elapsed

> 33.272 12.885 54.948

The next step will be to re-write it in C/C++, which will avoid some unnecessary memory allocations and should further speed things up.

I'm trying to figure out how easy/hard this fix is to port to the release branch and to make it available as bsseq v1.16.2. It's a little complicated by this code being part of the broader refactor of bsseq.

I'll also note that the revamped BSseq() constructor will still perform a check that the M and Cov matrices are valid, specifically that 0 <= M <= Cov < Inf. It doesn't require that either are integers, so your Nanopore data will be fine. This validity method is written in C++ and isn't going to get much faster; it will remain a deliberately unavoidable (and "slow") part of calling BSseq(). In the development version I'm adding an internal .BSseq() constructor (note the leading .) that will construct a BSseq object without performing this validity check. It's designed for when you already know your data are valid and, as such, it should be used with caution.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

PeteHaitch commented 6 years ago

I opted for the internal function .BSseq() following advice in https://bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/Extensions.html#defining-the-class-and-its-constructor. The internal constructor is useful for me when I've already got everything in the right format for a BSseq object, e.g., in read.bismark() (I'm not actually using it there yet because I still need to figure out how to disable validity check upon construction except by using S4Vectors::new2(check = FALSE)).

BSseq() is already quite complex because of the need to preserve a lot of historical behaviour. It does a lot of marshalling of its inputs before creating the BSseq object. I've managed to simplify the internals of BSseq() while preserving (the documented) historical behaviour. BSseq() is also arguably not a good constructor because calling it without any arguments does not return a valid BSseq object (see recommendation in https://bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/Extensions.html#deriving-a-class-with-custom-slots).

kasperdanielhansen commented 6 years ago

We could change the output of BSseq() to work.

On Wed, Jun 20, 2018 at 7:48 PM, Peter Hickey notifications@github.com wrote:

I opted for the internal function .BSseq() following advice in https://bioconductor.org/packages/devel/bioc/vignettes/ SummarizedExperiment/inst/doc/Extensions.html#defining-the- class-and-its-constructor). The internal constructor is useful for me when I know that I'm constructing a valid BSseq object, e.g., in read.bismark().

BSseq() is already quite complex because of the need to preserve a lot of historical behaviour. It does a lot of marshalling of its inputs before creating the BSseq object. I've managed to simplify the internals of BSseq() while preserving (the documented) historical behaviour. BSseq() is also arguably not a good constructor because calling it without any arguments does not return a valid BSseq object (see recommendation in https://bioconductor.org/packages/devel/bioc/vignettes/ SummarizedExperiment/inst/doc/Extensions.html#deriving-a- class-with-custom-slots).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/71#issuecomment-398838358, or mute the thread https://github.com/notifications/unsubscribe-auth/AEuhn7mLThp5_z7G0E3bHRlV4cQIUkJwks5t-orTgaJpZM4UlOm- .

scottgigante commented 6 years ago

Thanks very much @PeteHaitch , I appreciate your help! Would it be advisable to wait until the next release for this, or should I try installing from the refactor branch?

scottgigante commented 6 years ago

Actually, never mind - I can just remove duplicates from my data frame and work from the current release.

PeteHaitch commented 6 years ago

You'll want to aggregate the duplicate loci rather than remove them (unless they really are erroneous duplicates). I suspect something went awry in your join, which led to the duplicated loci.

Installing the refactor branch is a bit of a pain in the arse. I'm swamped this week, but will take a look this weekend at porting the change to the BioC version (no promises).