hansenlab / bsseq

Devel repository for bsseq
36 stars 26 forks source link

BSmooth error #104

Open priyatamapandey opened 3 years ago

priyatamapandey commented 3 years ago

Hi, I am trying to use BSmooth function on the 16 WGBS samples and it showed the following error.

Screen Shot 2021-09-29 at 2 23 40 PM

Please suggest me what I need to correct? I used the following code for reading my 16 coverage files and then performing the BSmooth.

bismarkBSseq <- read.bismark(files = file.list,
                               colData = DataFrame(row.names = sampleName),
                                BPPARAM = bpparam,
                               rmZeroCov = FALSE,
                               strandCollapse = FALSE,
                               verbose = TRUE)

pData(bismarkBSseq) <- sample_covariate

bismarkBSseq.fit <- BSmooth(
    BSseq = bismarkBSseq, 
    BPPARAM = MulticoreParam(workers = 8), 
    verbose = TRUE)

Thank you so much for your help. Priya

PeteHaitch commented 3 years ago

Please share the output of BiocManager::valid()

kasperdanielhansen commented 3 years ago

You tend to get this kind of error when you have very short sequences in the genome, I think. I would look a bit at the length of especially smaller contigs. What organism are you using?

On Wed, Sep 29, 2021 at 6:36 PM Peter Hickey @.***> wrote:

Please share the output of BiocManager::valid()

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/104#issuecomment-930595397, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHZZ25DHALMC2WVRAH3UEOIFDANCNFSM5FA3E3PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Best, Kasper

priyatamapandey commented 3 years ago

Hi Pete and Hansen, Thanks so much for your reply. Output of BiocManager::valid() showed TRUE. I am working on human WGBS samples. I ran for chr22 only and it worked with warning (attached).

Screen Shot 2021-09-29 at 7 18 42 PM

Please suggest. Thank you, Priya

kasperdanielhansen commented 3 years ago
  1. Please print the BSseq22 object.
  2. If this is human data, why do you have 65M CpGs?

On Wed, Sep 29, 2021 at 10:20 PM Priyatama Pandey @.***> wrote:

Hi Pete and Hansen, Thanks so much for your reply. Output of BiocManager::valid() showed TRUE. I am working on human WGBS samples. I ran for chr22 only and it worked with warning (attached).

[image: Screen Shot 2021-09-29 at 7 18 42 PM] https://user-images.githubusercontent.com/17990280/135374935-4c0174ef-bba5-418b-983f-9f5cd679cd2a.png

Please suggest. Thank you, Priya

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/104#issuecomment-930695856, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH5QS6ZMZYRGNE4FRRTUEPCM7ANCNFSM5FA3E3PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Best, Kasper

PeteHaitch commented 3 years ago

~Also, please show the full output of BiocManager::valid(), which includes useful information like which version of R and Bioconductor you are using.~

Sorry, it only returns that info that if it returns FALSE. Instead, please share the output of sessionInfo() and BiocManager::version()

priyatamapandey commented 3 years ago

Hi Kasper,

  1. Screen Shot 2021-09-30 at 9 36 50 AM
  2. I am new in this project and I don't know why I have 65MCpGs? According to the literature, human WGBS data usually have 28millions CpGs so your question is why I have more CpGs or less? Thank you, Priya

priyatamapandey commented 3 years ago

Hi Pete, When I ran again it returns the session info.

Screen Shot 2021-09-30 at 9 27 48 AM Screen Shot 2021-09-30 at 9 25 33 AM

Thank you, Priya

PeteHaitch commented 3 years ago

Thanks, @priyatamapandey. Everything looks up to date, which makes debugging easier :)

Regarding the 65M CpGs, are your CpGs stranded or have they been aggregated across strands? The easiest way to determine this is to look at rowData(BSseq) and check:

  1. the strand column and
  2. if there are adjacent CpGs, e.g., chr1:4 and chr1:5 would likely refer to the same CpG but the first one on the + strand and the second on the - strand.

If you share the output of running rowData() on your object we might be able to help debug.

We typically aggregate across strands before smoothing CpG methylation, and bsseq::combineList() should be able to help with this (if needed).

Unfortunately, we've also experienced sporadic failures when smoothing large BSseq objects that can lead to these obscure error message, particularly when using parallelisation on a shared computing environment (e.g., a university cluster). One option is to smooth each chromosome separately and then rbind() the results. Smoothing is independent across chromosomes (and samples), so you might manually split up the object into smaller subsets and smooth and then rbind() and/or cbind() the results back together (I'd probably advise against subsetting within a chromosome because that might give slightly different results to smoothing the entire chromosome). It's a bit clunky but should work and allows you to re-smooth particular subsets if you experience one of these sporadic failures

priyatamapandey commented 3 years ago

Hi Pete,

Thank you for all the suggestions. So, do you think I should not use multicore option even for the chromosome wise smoothing. It have split up the object with using 3 cores/parallel and it failed with the following error.

Screen Shot 2021-10-01 at 11 37 49 AM

Then, I have tried chromosome2 without using multiple core. Here is my command.

bismarkBSseqOrdered2 <- orderBSseq(bismarkBSseq, seqOrder = "chr2" )

 bismarkBSseq.chr2 <- BSmooth(
    BSseq = bismarkBSseqOrdered2, 
    BPPARAM = SerialParam(progressbar = TRUE),
    verbose = TRUE)

It took few hours to complete and at the end it retuned error. Although for chr1 the above code worked after taking long hours. I am running on my iMAC for now which has 32gb memory and 8core. I do have 100 samples to process once I get successful, then I would probably use institute cluster to perform.

Here is the error for chr2 smoothing.

Screen Shot 2021-10-01 at 11 04 05 AM

I have also ran rowData() on the enitre BSseq object and on chr1 and chr2 BSseq object. Since it did not return any column information. I am attaching the BSseq object jist here. It might help to understand the stand information.

Screen Shot 2021-10-01 at 11 12 04 AM

This is the large BSseq object

Screen Shot 2021-10-01 at 11 14 40 AM

Thank you so much for your help, Priya

kasperdanielhansen commented 3 years ago

In your code example you def. specify a strand-specific analysis (strandCollapse = FALSE) which is pretty unusual and certainly not what I would recommend unless you have a very specific question in mind.

Even account for that, I think you have too many Cpgs.

On Fri, Oct 1, 2021 at 2:49 PM Priyatama Pandey @.***> wrote:

Hi Pete,

Thank you for all the suggestions. So, do you think I should not use multicore option even for the chromosome wise smoothing. It have split up the object with using 3 cores/parallel and it failed with the following error.

[image: Screen Shot 2021-10-01 at 11 37 49 AM] https://user-images.githubusercontent.com/17990280/135670539-876bf620-686c-4b33-bd9d-5b398a4adeee.png

Then, I have tried chromosome2 without using multiple core. Here is my command.

bismarkBSseqOrdered2 <- orderBSseq(bismarkBSseq, seqOrder = "chr2" )

bismarkBSseq.chr2 <- BSmooth( BSseq = bismarkBSseqOrdered2, BPPARAM = SerialParam(progressbar = TRUE), verbose = TRUE)

It took few hours to complete and at the end it retuned error. Although for chr1 the above code worked after taking long hours. I am running on my iMAC for now which has 32gb memory and 8core. I do have 100 samples to process once I get successful, then I would probably use institute cluster to perform.

Here is the error for chr2 smoothing.

[image: Screen Shot 2021-10-01 at 11 04 05 AM] https://user-images.githubusercontent.com/17990280/135669212-26e37f3a-3e91-45f2-8337-9fadeef5d55b.png

I have also ran rowData() on the enitre BSseq object and on chr1 and chr2 BSseq object. Since it did not return any column information. I am attaching the BSseq object jist here. It might help to understand the stand information.

[image: Screen Shot 2021-10-01 at 11 12 04 AM] https://user-images.githubusercontent.com/17990280/135669686-857e19f5-c3fa-43ee-a00b-5f2320c6f6b1.png

This is the large BSseq object

[image: Screen Shot 2021-10-01 at 11 14 40 AM] https://user-images.githubusercontent.com/17990280/135669600-f7aa89f8-9e45-491a-b8a3-8952050f3538.png

Thank you so much for your help, Priya

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/104#issuecomment-932472441, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH62ZLCNPHUXP4J37MLUEX7E7ANCNFSM5FA3E3PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Best, Kasper

priyatamapandey commented 3 years ago

Hi Kasper, Ok. I will change that. I found that example in read.bismark R documentation so thought to start with. Also, I checked my coverage file and found that I do have many more chromosome than Chr1 to chr22 and X and Y such as chr19_KI270883v1_alt, chr19_KI270890v1_alt, chr19_KI270919v1_alt, chrUn_KI270334v1, chrUn_KI270373v1, chrUn_KI270376v1.
I think that is causing many addition CpGs. I want to consider only chr1 to chr22 and chrX and chrY while reading files in the read.bismark but do not understand how to set that condition? Is there Loci parameter contain that option in read.bismark function?

Thank you, Priya

kasperdanielhansen commented 3 years ago

All of those weird looking contigs (chromosomes) is likely the cause of the error. What happens is that some of these contigs contains too few CpGs for the smoothing to work.

I strongly recommend the following. 1) recreate your objects with strandCollapse=TRUE to ensure you don't do a strand specific analysis 2) get rid of all the contigs (at least to start with); the easiest one is to only keep the standard autosomes.

That should resolve your error with 98% probability. I don't think you will have problems with parallelization although - as Pete has said - sometimes we see problems.

Best, Kasper

On Fri, Oct 1, 2021 at 6:20 PM Priyatama Pandey @.***> wrote:

Hi Kasper, Ok. I will change that. I found that example in read.bismark R documentation so thought to start with. Also, I checked my coverage file and found that I do have many more chromosome than Chr1 to chr22 and X and Y such as chr19_KI270883v1_alt, chr19_KI270890v1_alt, chr19_KI270919v1_alt, chrUn_KI270334v1, chrUn_KI270373v1, chrUn_KI270376v1. I think that is causing many addition CpGs. I want to consider only chr1 to chr22 and chrX and chrY while reading files in the read.bismark but do not understand how to set that condition? Is there Loci parameter contain that option in read.bismark function?

Thank you, Priya

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/104#issuecomment-932609950, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH5JKVFJSL7XUKGEUYLUEYX2DANCNFSM5FA3E3PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Best, Kasper

PeteHaitch commented 3 years ago

I want to consider only chr1 to chr22 and chrX and chrY while reading files in the read.bismark but do not understand how to set that condition? Is there Loci parameter contain that option in read.bismark function?

There's a section 'Specifying loci' in the documentation for read.bismark() (see help("read.bismark", package = "bsseq")), so you can do it that way or by subsetting your BSseq object to only the chromosomes you're interested in.

priyatamapandey commented 2 years ago

Happy New Year Dr. Hansen and Pete,

I have run bsmooth on the default setting and also want to customize the setting to see, whether different settings are improving our results but most of the setting getting failed at some point. I have divided my data chromosome wise and some of the setting is getting fail for chromosome 1, some for chromosome 2 etc.

Below is the setting list which I have already tried

Screen Shot 2022-01-05 at 3 43 01 PM

Here is the error returned

Screen Shot 2022-01-06 at 12 51 14 PM

Please suggest how to fix this issue. Any idea why it is throwing error?

Thank you so much for your help, Priya

PeteHaitch commented 2 years ago

It's very hard to know and I can only refer you to our previous comments. If the data are (strand-aggregated) CpGs from human samples then I think the default smoothing parameters are a sensible choice.

priyatamapandey commented 2 years ago

Thank you for your suggestion,

I am working on the DMRs now and want to understand the cutoff options. I don't understand how to choose cutoff vector? I calculated t-statistics and generated the following figure. I don't see much difference in corrected and uncorrected in figure.

Screen Shot 2022-01-24 at 8 26 47 PM Screen Shot 2022-01-24 at 8 27 02 PM Screen Shot 2022-01-24 at 8 32 09 PM

Please suggest me how to select cutoff while checking the t-statistics? Thank you, Priya

kir1to455 commented 2 years ago

Hi,Priya I also meet same problems about the parameters (ns=70,h=1000,maxGap=10^8) when I using Bsmooth . I found some chromosomes succeed but some chromosomes failed. image If you have solved this problems, can you give some advice? Thank you in advance, Kirito

priyatamapandey commented 2 years ago

HI Kirito, At default setting, BSmooth worked. I wanted to try BSmooth with different settings as my data is not cancer data but it did not work for the another setting mostly failed at chromosome 1. I would suggest if you are running on the cluster/HPC, try to provide more memory and keep minimum workers.

Best, Priya

PeteHaitch commented 2 years ago

We've used the default settings for many different human (and possibly mouse) sample types, not just cancer data. I think the recommendation would be not to change the default settings unless you have good reason.

kir1to455 commented 2 years ago

We've used the default settings for many different human (and possibly mouse) sample types, not just cancer data. I think the recommendation would be not to change the default settings unless you have good reason.

Hi Pete, I also meet some problems about the parameters (ns=70,h=1000,maxGap=10^8) when I using Bsmooth . I found some chromosomes succeed but some chromosomes failed. For example , chr 1 failed, chrX succeed ,chr21 succeed. chr1 image image chrX image chr21 image Also,the results with warning messages. Would you give some advice? Best wishes, Kirito

PeteHaitch commented 2 years ago

It's very hard without a reproducible example and further details. What species is your sample? What methylation context (e.g., CpGs)? What computing resources are you running on?

kir1to455 commented 2 years ago

It's very hard without a reproducible example and further details. What species is your sample? What methylation context (e.g., CpGs)? What computing resources are you running on?

Species is human, and context is GpC.I running it in my PC, and with 3.6GHz 6core processor.

PeteHaitch commented 2 years ago

Is it really GpC (e.g., a NOMe-Seq-like protocol) and not CpG methylation (e.g., a standard bisulfite-seq-like protocol)?

bsseq was initially designed for analysing CpG methylation data from bisulfite-sequencing protocols. This is what the default parameters to BSmooth() are aimed at (in particular, for human+mouse-like genomes). More recently, we have used bsseq to analyse CpH (e.g., CpA) methylation data from bisulfite-sequencing protocols in human samples, but with different smoothing parameters; importantly, getting satisfactory smoothing results was a lot of trial and error.

As far as I'm aware, neither @kasperdanielhansen or I have used it to analyse GpC methylation from NOMe-Seq-like data. And I very much doubt the default parameters would work out of the box since the distribution of GpCs along the genome is very different to the distribution of CpGs, as illustrated below:

> suppressPackageStartupMessages(library(bsseq))
> suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
> # Identify all GpC and CpG dinucleotides on forward strand of chr1 in GRCh38
> gpc_chr1 <- findLoci("GC", BSgenome.Hsapiens.UCSC.hg38, "chr1", strand = "+")
> cpg_chr1 <- findLoci("CG", BSgenome.Hsapiens.UCSC.hg38, "chr1", strand = "+")
> # How many are there?
> length(gpc_chr1)
[1] 10145272
> length(cpg_chr1)
[1] 2375159
> # What is the distance between adjacent loci?
> par(mfrow = c(1, 2))
> plot(density(log10(start(gpc_chr1)[-1] - start(gpc_chr1)[-length(gpc_chr1)])), "GpC")
> plot(density(log10(start(cpg_chr1)[-1] - start(cpg_chr1)[-length(cpg_chr1)])), "CpG")

image

As bsseq is not designed for analysing GpC data and we have not tried this ourselves, we cannot really offer more substantial support. bsseq may be suitable, but my guess it's going to take some substantial research effort to identify appropriate parameters.

kasperdanielhansen commented 2 years ago

The smoothing code has problems when you have a lot of sites (CpGs / GpCs) and a small bandwidth relative to the number of sites. The potential way out of this is to set maxGap which will divide a chromosome into more than 1 "chunk" if there is a stretch of DNA without sites. The default values of maxGap is 10^8 which essentially mean "don't do this" and this works for the default setting and human chromosomes.

Because there are more GpCs you will probably need to set maxGap to something.

Also, before you run it genome wide I suggest to look at a couple of loci in great detail to get a feel for smoothing parameters.

Best, Kasper

On Wed, Feb 9, 2022 at 7:47 PM Peter Hickey @.***> wrote:

Is it really GpC (e.g., a NOMe-Seq https://www.illumina.com/science/sequencing-method-explorer/kits-and-arrays/nome-seq.html-like protocol) and not CpG methylation (e.g., a standard bisulfite-seq-like protocol)?

bsseq was initially designed for analysing CpG methylation data from bisulfite-sequencing protocols. This is what the default parameters to BSmooth() are aimed at (in particular, for human+mouse-like genomes). More recently, we have used bsseq to analyse CpH (e.g., CpA) methylation data from bisulfite-sequencing protocols in human samples, but with different smoothing parameters; importantly, getting satisfactory smoothing results was a lot of trial and error.

As far as I'm aware, neither @kasperdanielhansen https://github.com/kasperdanielhansen or I have used it to analyse GpC methylation from NOMe-Seq-like data. And I very much doubt the default parameters would work out of the box since the distribution of GpCs along the genome is very different to the distribution of CpGs, as illustrated below:

suppressPackageStartupMessages(library(bsseq))> suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))> # Identify all GpC and CpG dinucleotides on forward strand of chr1 in GRCh38> gpc_chr1 <- findLoci("GC", BSgenome.Hsapiens.UCSC.hg38, "chr1", strand = "+")> cpg_chr1 <- findLoci("CG", BSgenome.Hsapiens.UCSC.hg38, "chr1", strand = "+")> # How many are there?> length(gpc_chr1) [1] 10145272> length(cpg_chr1) [1] 2375159> # What is the distance between adjacent loci?> par(mfrow = c(1, 2))> plot(density(log10(start(gpc_chr1)[-1] - start(gpc_chr1)[-length(gpc_chr1)])), "GpC")> plot(density(log10(start(cpg_chr1)[-1] - start(cpg_chr1)[-length(cpg_chr1)])), "CpG")

[image: image] https://user-images.githubusercontent.com/1049741/153315106-b651ad04-5704-4c38-80e5-527c8edd862f.png

As bsseq is not designed for analysing GpC data and we have not tried this ourselves, we cannot really offer more substantial support. bsseq may be suitable, but my guess it's going to take some substantial research effort to identify appropriate parameters.

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/104#issuecomment-1034365408, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHYFMSNQI7BN4RJKJUTU2MDJPANCNFSM5FA3E3PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Best, Kasper