kdkorthauer / dmrseq

R package for Inference of differentially methylated regions (DMRs) from bisulfite sequencing
MIT License
54 stars 14 forks source link

read.bismark doesn't work. #65

Closed desmodus1984 closed 2 months ago

desmodus1984 commented 3 months ago

Hi, I generated some input data with Bismark, with the following code: bismark_methylation_extractor -p --scaffolds --bedGraph --buffer_size 40G --genome_folder /home/juaguila/Bismark/ --cytosine_report --parallel 10 ${base}.deduplicated.bam and then, I tested if the input data could be read by bsseq.

`files <- c("V00001.deduplicated.bismark.cov.gz", "V00750.deduplicated.bismark.cov.gz", "V00796.deduplicated.bismark.cov.gz") samples <- c("V00001", "V00750", "V00796")'

'bismarkBSseq <- read.bismark(files = files, rmZeroCov = TRUE, strandCollapse = TRUE, verbose = TRUE, nThread = 10L) bismarkBSseq saveRDS(bismarkBSseq, file = "dmrseq_LH.meth.rds")`

I was told to save every output, since the code can fail so I used saveRDS, which has worked before when running MethylKit. This time, I ran the previous code, it was supposedly working , but no output was generated. Everything loaded well, but then, it didn't finish generating the matrices

[read.bismark] Parsing files and constructing valid loci ... Done in 64.9 secs [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...

Any ideas why this is not working? I am worried because now I am just testing with 6 (six) files, when my total sample size is 44. Is my code okay? I would really appreciate if you could help figure out what is going on.

The files look okay: gzip -cd V00864.deduplicated.bismark.cov.gz | head NW_022882922.1 4856 4856 0 0 4 NW_022882922.1 4866 4866 0 0 4 NW_022882922.1 4889 4889 0 0 5 NW_022882922.1 4892 4892 0 0 5 NW_022882922.1 4904 4904 0 0 5 NW_022882922.1 4953 4953 0 0 6 NW_022882922.1 4962 4962 0 0 7 NW_022882922.1 4966 4966 0 0 6 NW_022882922.1 4976 4976 0 0 5 NW_022882922.1 4977 4977 0 0 1

I am analyzing methylation data in bumblebees and each file has about 20.M lines, though not all the sites are present in all individuals.

I would highly appreciate and will be very thankful if you could give my any hints and suggestions on how to analyze my data. I was planning once the input files were accepted to filter sites with less than 10 reads.

Thank you very much;

kdkorthauer commented 3 months ago

Hi,

As read.bismark() is not a function from dmrseq (it is from the bsseq package), I'd suggest you refer to the documentation from that package, and reach out to the respective maintainers if you're still stuck. Thanks!

desmodus1984 commented 2 months ago

Hi, I was able to figure out the issue with read.bismark().

I did a small test. When I ran 3 samples, I had not problem, but then when I made a bigger dataset, with 9 samples, I got the following error:

ccBSMK <- read.bismark(files = files,

                 rmZeroCov = TRUE,
                 strandCollapse = TRUE,
                 verbose = TRUE,
                 nThread = 1L)

[read.bismark] Parsing files and constructing valid loci ... Done in 39 secs [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ... Done in 25.1 secs [read.bismark] Constructing BSseq object ... Warning message: In .merge_two_Seqinfo_objects(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': NW_022883409.1 in 'y': NW_022883010.1 Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). show(ccBSMK) An object of type 'BSseq' with 28393268 methylation loci 9 samples has not been smoothed All assays are in-memory dim(ccBSMK) [1] 28393268 9

Is this error very important/ significant?

I was able to then add the covariate

pData(ccBSMK) DataFrame with 9 rows and 1 column elev

V00001 L V00750 L V00796 L V00861 L V01425 L V01712 H V01926 H V01947 H V02055 H

testCovariate <- "elev" regions <- dmrseq(bs=ccBSMK,

          cutoff = 0.05,
          testCovariate=testCovariate)

Assuming the test covariate elev is a factor. Condition: L vs H Error in dmrseq(bs = ccBSMK, cutoff = 0.05, testCovariate = testCovariate) : 418276 loci have zero coverage in all samples of at least one condition. Please remove these loci before running dmrseq

I filtered those loci

Filter loci

loci.idx <- which(DelayedMatrixStats::rowSums2(getCoverage(ccBSMK, type="Cov")==0) == 0) bs.filtered <- ccBSMK[loci.idx,] dim(loci.idx) NULL length(loci.idx) [1] 26622731 bs.filtered An object of type 'BSseq' with 26622731 methylation loci 9 samples has not been smoothed All assays are in-memory

And then when I tried running it again, I got a locfit error.

regions <- dmrseq(bs=bs.filtered,

          cutoff = 0.05,
          testCovariate=testCovariate)

Assuming the test covariate elev is a factor. Condition: L vs H Parallelizing using 8 workers/cores (backend: BiocParallel:MulticoreParam). Computing on 1 chromosome(s) at a time.

Detecting candidate regions with coefficient larger than 0.05 in magnitude. ...Chromosome NW_022882922.1: Error in (function (cond) : error in evaluating the argument 'args' in selecting a method for function 'do.call': BiocParallel errors 6 remote errors, element index: 1, 2, 3, 4, 5, 6 0 unevaluated and other errors first remote error: Error in locfitByCluster2(idx): lazy-load database '/home/juaguila/R/x86_64-pc-linux-gnu-library/4.3/locfit/R/locfit.rdb' is corrupt

I then reinstalled locfit, and the error persisted.

regions <- dmrseq(bs=bs.filtered,

          cutoff = 0.05,
          testCovariate=testCovariate)

Assuming the test covariate elev is a factor. Condition: L vs H Parallelizing using 8 workers/cores (backend: BiocParallel:MulticoreParam). Computing on 1 chromosome(s) at a time.

Detecting candidate regions with coefficient larger than 0.05 in magnitude. ...Chromosome NW_022882922.1: Error in (function (cond) : error in evaluating the argument 'args' in selecting a method for function 'do.call': BiocParallel errors 6 remote errors, element index: 1, 2, 3, 4, 5, 6 0 unevaluated and other errors first remote error: Error in locfitByCluster2(idx): lazy-load database '/home/juaguila/R/x86_64-pc-linux-gnu-library/4.3/locfit/R/locfit.rdb' is corrupt

sessionInfo() R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] BiocParallel_1.34.2 dmrseq_1.20.1 [3] bsseq_1.36.0 SummarizedExperiment_1.30.2 [5] Biobase_2.60.0 MatrixGenerics_1.12.3 [7] matrixStats_1.3.0 GenomicRanges_1.52.1 [9] GenomeInfoDb_1.36.4 IRanges_2.34.1 [11] S4Vectors_0.38.2 BiocGenerics_0.46.0

loaded via a namespace (and not attached): [1] DBI_1.2.2 bitops_1.0-7 [3] biomaRt_2.56.1 permute_0.9-7 [5] rlang_1.1.3 magrittr_2.0.3 [7] compiler_4.3.2 RSQLite_2.3.6 [9] GenomicFeatures_1.52.2 DelayedMatrixStats_1.22.6 [11] reshape2_1.4.4 png_0.1-8 [13] vctrs_0.6.5 stringr_1.5.1 [15] pkgconfig_2.0.3 crayon_1.5.2 [17] fastmap_1.1.1 dbplyr_2.5.0 [19] XVector_0.40.0 ellipsis_0.3.2 [21] utf8_1.2.4 Rsamtools_2.16.0 [23] promises_1.3.0 tzdb_0.4.0 [25] bit_4.0.5 zlibbioc_1.46.0 [27] cachem_1.0.8 progress_1.2.2 [29] blob_1.2.4 later_1.3.2 [31] rhdf5filters_1.12.1 DelayedArray_0.26.7 [33] Rhdf5lib_1.22.1 interactiveDisplayBase_1.38.0 [35] prettyunits_1.2.0 parallel_4.3.2 [37] R6_2.5.1 RColorBrewer_1.1-3 [39] stringi_1.8.3 limma_3.56.2 [41] rtracklayer_1.60.1 pkgload_1.3.4 [43] iterators_1.0.14 Rcpp_1.0.12 [45] R.utils_2.12.3 readr_2.1.5 [47] splines_4.3.2 httpuv_1.6.15 [49] Matrix_1.6-4 tidyselect_1.2.0 [51] rstudioapi_0.16.0 abind_1.4-5 [53] yaml_2.3.8 codetools_0.2-19 [55] curl_5.2.1 doRNG_1.8.6 [57] plyr_1.8.9 regioneR_1.32.0 [59] lattice_0.22-6 tibble_3.2.1 [61] shiny_1.8.1.1 KEGGREST_1.40.1 [63] BiocFileCache_2.8.0 xml2_1.3.6 [65] Biostrings_2.68.1 pillar_1.9.0 [67] BiocManager_1.30.22 filelock_1.0.3 [69] rngtools_1.5.2 foreach_1.5.2 [71] generics_0.1.3 RCurl_1.98-1.13 [73] ggplot2_3.5.0 hms_1.1.3 [75] BiocVersion_3.17.1 sparseMatrixStats_1.12.2 [77] munsell_0.5.0 scales_1.3.0 [79] bumphunter_1.42.0 gtools_3.9.5 [81] xtable_1.8-4 glue_1.7.0 [83] tools_4.3.2 AnnotationHub_3.8.0 [85] BiocIO_1.10.0 data.table_1.15.4 [87] BSgenome_1.68.0 locfit_1.5-9.9 [89] GenomicAlignments_1.36.0 XML_3.99-0.16 [91] rhdf5_2.44.0 grid_4.3.2 [93] AnnotationDbi_1.62.2 colorspace_2.1-0 [95] nlme_3.1-164 GenomeInfoDbData_1.2.10 [97] HDF5Array_1.28.1 restfulr_0.0.15 [99] annotatr_1.26.0 cli_3.6.2 [101] rappdirs_0.3.3 fansi_1.0.6 [103] S4Arrays_1.0.6 dplyr_1.1.4 [105] gtable_0.3.4 outliers_0.15 [107] R.methodsS3_1.8.2 digest_0.6.35 [109] rjson_0.2.21 memoise_2.0.1 [111] htmltools_0.5.8.1 R.oo_1.26.0 [113] lifecycle_1.0.4 httr_1.4.7 [115] mime_0.12 bit64_4.0.5

BiocManager::valid() 'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package = "BiocManager")' for details. Replacement repositories: CRAN: https://cloud.r-project.org/

sessionInfo() R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] BiocParallel_1.34.2 dmrseq_1.20.1 [3] bsseq_1.36.0 SummarizedExperiment_1.30.2 [5] Biobase_2.60.0 MatrixGenerics_1.12.3 [7] matrixStats_1.3.0 GenomicRanges_1.52.1 [9] GenomeInfoDb_1.36.4 IRanges_2.34.1 [11] S4Vectors_0.38.2 BiocGenerics_0.46.0

loaded via a namespace (and not attached): [1] DBI_1.2.2 bitops_1.0-7 [3] biomaRt_2.56.1 permute_0.9-7 [5] rlang_1.1.3 magrittr_2.0.3 [7] compiler_4.3.2 RSQLite_2.3.6 [9] GenomicFeatures_1.52.2 DelayedMatrixStats_1.22.6 [11] reshape2_1.4.4 png_0.1-8 [13] vctrs_0.6.5 stringr_1.5.1 [15] pkgconfig_2.0.3 crayon_1.5.2 [17] fastmap_1.1.1 dbplyr_2.5.0 [19] XVector_0.40.0 ellipsis_0.3.2 [21] utf8_1.2.4 Rsamtools_2.16.0 [23] promises_1.3.0 tzdb_0.4.0 [25] bit_4.0.5 zlibbioc_1.46.0 [27] cachem_1.0.8 progress_1.2.2 [29] blob_1.2.4 later_1.3.2 [31] rhdf5filters_1.12.1 DelayedArray_0.26.7 [33] Rhdf5lib_1.22.1 interactiveDisplayBase_1.38.0 [35] prettyunits_1.2.0 parallel_4.3.2 [37] R6_2.5.1 RColorBrewer_1.1-3 [39] stringi_1.8.3 limma_3.56.2 [41] rtracklayer_1.60.1 pkgload_1.3.4 [43] iterators_1.0.14 Rcpp_1.0.12 [45] R.utils_2.12.3 readr_2.1.5 [47] splines_4.3.2 httpuv_1.6.15 [49] Matrix_1.6-4 tidyselect_1.2.0 [51] rstudioapi_0.16.0 abind_1.4-5 [53] yaml_2.3.8 codetools_0.2-19 [55] curl_5.2.1 doRNG_1.8.6 [57] plyr_1.8.9 regioneR_1.32.0 [59] lattice_0.22-6 tibble_3.2.1 [61] shiny_1.8.1.1 KEGGREST_1.40.1 [63] BiocFileCache_2.8.0 xml2_1.3.6 [65] Biostrings_2.68.1 pillar_1.9.0 [67] BiocManager_1.30.22 filelock_1.0.3 [69] rngtools_1.5.2 foreach_1.5.2 [71] generics_0.1.3 RCurl_1.98-1.13 [73] ggplot2_3.5.0 hms_1.1.3 [75] BiocVersion_3.17.1 sparseMatrixStats_1.12.2 [77] munsell_0.5.0 scales_1.3.0 [79] bumphunter_1.42.0 gtools_3.9.5 [81] xtable_1.8-4 glue_1.7.0 [83] tools_4.3.2 AnnotationHub_3.8.0 [85] BiocIO_1.10.0 data.table_1.15.4 [87] BSgenome_1.68.0 locfit_1.5-9.9 [89] GenomicAlignments_1.36.0 XML_3.99-0.16 [91] rhdf5_2.44.0 grid_4.3.2 [93] AnnotationDbi_1.62.2 colorspace_2.1-0 [95] nlme_3.1-164 GenomeInfoDbData_1.2.10 [97] HDF5Array_1.28.1 restfulr_0.0.15 [99] annotatr_1.26.0 cli_3.6.2 [101] rappdirs_0.3.3 fansi_1.0.6 [103] S4Arrays_1.0.6 dplyr_1.1.4 [105] gtable_0.3.4 outliers_0.15 [107] R.methodsS3_1.8.2 digest_0.6.35 [109] rjson_0.2.21 memoise_2.0.1 [111] htmltools_0.5.8.1 R.oo_1.26.0 [113] lifecycle_1.0.4 httr_1.4.7 [115] mime_0.12 bit64_4.0.5

Bioconductor version '3.17'

142 packages out-of-date 0 packages too new create a valid installation with

BiocManager::install(c( "bbmle", "bdsmatrix", "BH", "brew", "brio", "bslib", "callr", "cli", "coda", "commonmark", "cpp11", "curl", "data.table", "DBI", "DBItest", "dbplyr", "ddalpha", "deldir", "desc", "digest", "DT", "e1071", "earth", "fansi", "fastcluster", "filelock", "FNN", "future", "future.apply", "ggplot2", "gh", "globals", "glue", "gplots", "gtools", "hardhat", "Hmisc", "htmltools", "htmlwidgets", "httpuv", "httr2", "igraph", "interp", "knitr", "ks", "later", "lava", "lifecycle", "listenv", "littler", "lme4", "locfit", "lpSolve", "lwgeom", "maps", "markdown", "matrixStats", "mclust", "mets", "modeldata", "multicool", "munsell", "parallelly", "pkgbuild", "pkgdown", "pkgload", "plot3D", "plotmo", "processx", "progress", "promises", "ps", "quantmod", "R.oo", "ragg", "Rcpp", "RcppArmadillo", "RcppEigen", "RCurl", "readr", "recipes", "remotes", "rgl", "rlang", "rmarkdown", "robustbase", "roxygen2", "RPostgreSQL", "rsample", "RSQLite", "rstudioapi", "RUnit", "s2", "sandwich", "sass", "scales", "setRNG", "sf", "sfsmisc", "shape", "shiny", "sp", "spatstat", "spatstat.data", "spatstat.explore", "spatstat.geom", "spatstat.linnet", "spatstat.model", "spatstat.random", "stringi", "svglite", "systemfonts", "TeachingDemos", "terra", "textshaping", "tidyr", "tidyselect", "timechange", "timeDate", "timeSeries", "tinytex", "tseries", "usethis", "viridis", "vroom", "webfakes", "WGCNA", "withr", "xfun", "XML", "xts", "yaml", "zip" ), update = TRUE, ask = FALSE, force = TRUE)

more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date

Warning message: 142 packages out-of-date; 0 packages too new

I tried updating after the suggesting from BiocManager and some failed.

Thanks for your help and time.

kdkorthauer commented 2 months ago

Looks like the error is in calling a function in the locfit package which isn't properly installed. Can you try running again after reinstalling locfit in a fresh R session?

desmodus1984 commented 2 months ago

Hi Dr. Korthauer.

I did and for some unknown reason it worked now. I am studying bumblebee epigenetics, and methylation is very low, with very low about 100%, with most of them highly unmethylated. image

Do you think that this can affect the performance of the algorithm? Perhaps, like high false-positives/false-negatives? Low-power to detect differential methylated regions?

Also, I wanted to ask you. Is it possible to filter sites and set a minimum and maximum coverage, like 10 - 500. Methylkit has an option to set minimum depth, and then filter based on percentile. My samples have sites with very low and very high depth, image

And I was wondering if it is possible to do such filtering.

Thank you very much;

Juan Pablo

kdkorthauer commented 2 months ago

Glad you were able to resolve that issue.

I am studying bumblebee epigenetics, and methylation is very low, with very low about 100%, with most of them highly unmethylated. Do you think that this can affect the performance of the algorithm? Perhaps, like high false-positives/false-negatives? Low-power to detect differential methylated regions?

The low variability in methylation levels means that you should expect overall signal (e.g. differential methylation) to be low. This is because, based on that distribution, the vast majority of sites are completely unmethylated in both groups, so there's little chance that the methylation levels would be different. The same would be true of almost all sites were 100% methylated.

Also, I wanted to ask you. Is it possible to filter sites and set a minimum and maximum coverage, like 10 - 500. Methylkit has an option to set minimum depth, and then filter based on percentile. My samples have sites with very low and very high depth

You could filter based on coverage manually if you'd like before running a differential methylation analysis. However, for dmrseq it's not recommended (beyond filtering out sites with no coverage in most samples), as the method already puts higher weight on sites with higher coverage.

Hope that helps!