hansenlab / bsseq

Devel repository for bsseq
36 stars 26 forks source link

BSmooth.tstat issue #120

Closed sara-sajjadi closed 12 months ago

sara-sajjadi commented 1 year ago

Hi, I'm getting the following error while computing t-statistics with BSmooth.tstat : Error in h(simpleError(msg, call)) : error in evaluating the argument 'args' in selecting a method for function 'do.call': need at least two non-NA values to interpolate. There are no NAs in my dataset.

Can you please help with troubleshooting this?

> bs.fit
An object of type 'BSseq' with
  24298794 methylation loci
  11 samples
has been smoothed with
  BSmooth (ns = 70, h = 1000, maxGap = 100000000) 
All assays are in-memory

> summary(is.na(getCoverage(bs.fit, type = "M")))
     1001             1002             1003             1004             1005             1006             1007             1008         
 Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical   
 FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794  
    1009             1010             1011         
 Mode :logical    Mode :logical    Mode :logical   
 FALSE:24298794   FALSE:24298794   FALSE:24298794  

> BS.tstat <- BSmooth.tstat(bs.fit, 
+                                     group1 = c("1002", "1003", "1005", "1006", "1007", "1008"),
+                                     group2 = c("1001", "1004", "1009",    "1010", "1011"), 
+                                     estimate.var = "same",
+                                     local.correct = TRUE,
+                                     mc.cores = 1,
+                                     verbose = TRUE)
[BSmooth.tstat] preprocessing ... done in 23.9 sec
[BSmooth.tstat] computing stats within groups ... done in 24.6 sec
[BSmooth.tstat] computing stats across groups ... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'args' in selecting a method for function 'do.call': need at least two non-NA values to interpolate

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8    LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readxl_1.4.2                BiocParallel_1.34.2         bsseq_1.36.0                SummarizedExperiment_1.30.1
 [5] Biobase_2.60.0              MatrixGenerics_1.12.0       matrixStats_0.63.0          GenomicRanges_1.52.0       
 [9] GenomeInfoDb_1.36.0         IRanges_2.34.0              S4Vectors_0.38.1            BiocGenerics_0.46.0        

loaded via a namespace (and not attached):
 [1] rjson_0.2.21              rhdf5_2.44.0              lattice_0.21-8            rhdf5filters_1.12.1       vctrs_0.6.2              
 [6] tools_4.3.0               bitops_1.0-7              parallel_4.3.0            tibble_3.2.1              fansi_1.0.4              
[11] pkgconfig_2.0.3           R.oo_1.25.0               Matrix_1.5-4              data.table_1.14.8         BSgenome_1.68.0          
[16] sparseMatrixStats_1.12.0  lifecycle_1.0.3           GenomeInfoDbData_1.2.10   compiler_4.3.0            Rsamtools_2.16.0         
[21] Biostrings_2.68.1         munsell_0.5.0             codetools_0.2-19          permute_0.9-7             snow_0.4-4               
[26] RCurl_1.98-1.12           yaml_2.3.7                pillar_1.9.0              crayon_1.5.2              R.utils_2.12.2           
[31] DelayedArray_0.26.3       limma_3.56.1              gtools_3.9.4              locfit_1.5-9.7            restfulr_0.0.15          
[36] grid_4.3.0                colorspace_2.1-0          cli_3.6.1                 magrittr_2.0.3            S4Arrays_1.0.4           
[41] XML_3.99-0.14             utf8_1.2.3                DelayedMatrixStats_1.22.0 scales_1.2.1              XVector_0.40.0           
[46] cellranger_1.1.0          R.methodsS3_1.8.2         HDF5Array_1.28.1          BiocIO_1.10.0             rtracklayer_1.60.0       
[51] rlang_1.1.1               Rcpp_1.0.10               glue_1.6.2                BiocManager_1.30.20       rstudioapi_0.14          
[56] R6_2.5.1                  Rhdf5lib_1.22.0           GenomicAlignments_1.36.0  zlibbioc_1.46.0          
> BiocManager::valid()
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: http://cran.rstudio.com/
[1] TRUE
PeteHaitch commented 1 year ago

On the face of it, I'm not sure what's happening, but I wonder if there is a region where there are too few CpGs and thus the 'interpolation' error. Could you try re-running on just a subset of the CpGs (e.g., just those on a single chromosome, perhaps a small chromosome like chr22 if your data are human) and reporting back.

sara-sajjadi commented 1 year ago

Thanks @PeteHaitch ! It works alright with only chr22 data. I tried running the BSmooth.tstat with local.correct = FALSE on the whole dataset, it worked and here is what I found:

> summary(is.na(BS.tstat@stats))
 rawSds          tstat.sd        group2.means     group1.means       tstat         
Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical    
FALSE:24297269   FALSE:24297269   FALSE:24297763   FALSE:24297783   FALSE:24297269  
TRUE :1525       TRUE :1525       TRUE :1031       TRUE :1011       TRUE :1525      

Is there a way to fix this without having to subset the original file?

PeteHaitch commented 1 year ago

How many CpGs per chromosome do you have? Something like table(seqnames(bs.fit)) will tell you. I'm guessing there's a chromosome with very few CpGs and this is causing issues when the smoothing is being run during local.correct (although I'm not super familiar with that code)

PeteHaitch commented 1 year ago

Looking at table(seqnames(BS.tstat@gr), is.na(BS.tstat@stats[, "rawSds"])) etc. may also be informative to figure out where the NA values are occurring.

sara-sajjadi commented 1 year ago

Yes, that's right, chromosomes with very few CpGs have NAs! Thanks a lot for the help.

kasperdanielhansen commented 1 year ago

This issue keeps popping up. We should implement a pre-check on the number of CpGs per cluster (in this case: sequence contig or chromosome).

Best, Kasper

On Tue, May 30, 2023 at 5:05 AM Sara @.***> wrote:

Yes, that's right, chromosomes with very few CpGs have NAs! Thanks a lot for the help.

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/120#issuecomment-1568059865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHZZEDGSI32QPUOS4R3XIWZ3ZANCNFSM6AAAAAAYQ74MWY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Best, Kasper

sahuno commented 1 year ago

sorry to bother you all @kasperdanielhansen @sara-sajjadi @PeteHaitch Que - but how did you remove non-NA values or how did you solve it? i tried filtering out regions with less than 2x coverage in each group but that didn't work could you please provide some insights? thanks so much in advance!!!

############
#filter out regions where 2 or more samples have less than 2x coverage
############
BS.cov <- getCoverage(bismark_bsseq)
keepLoci.ex <- which(rowSums(BS.cov[, bismark_bsseq.fit$Type == "Tumor"] >= 2) >= 2 &
                     rowSums(BS.cov[, bismark_bsseq.fit$Type == "Normal"] >= 2) >= 2)
length(keepLoci.ex)
bismark_bsseq.fit <- bismark_bsseq.fit[keepLoci.ex,]
> bismark_bsseq.fit
An object of type 'BSseq' with
  51670328 methylation loci
  4 samples
has been smoothed with
  BSmooth (ns = 70, h = 1000, maxGap = 100000000) 
Some assays are HDF5Array-backed

##############
#do tstats
###############
> bismark_bsseq.tstat <- BSmooth.tstat(bismark_bsseq.fit, 
+                                     group1 = c("SA123T_1","SA123T_2"),
+                                     group2 = c("SA123N_1","SA123N_2"), 
+                                     estimate.var = "group2",
+                                     local.correct = TRUE,
+                                     verbose = TRUE)
[BSmooth.tstat] preprocessing ... done in 27.2 sec
[BSmooth.tstat] computing stats within groups ... done in 23.4 sec
[BSmooth.tstat] computing stats across groups ... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'args' in selecting a method for function 'do.call': need at least two non-NA values to interpolate

############
####check counts of chromosome level methylation, looks ok
############
> table(seqnames(bismark_bsseq.fit))

      1       2       3       4       5       6       7       8       9      10 
4249685 4046504 3068293 2720217 2734683 2726634 2868261 2442494 2175831 2519427 
     11      12      13      14      15      16      17      18      19      20 
2395826 2422185 1451962 1592770 1539470 2032957 2150986 1258984 1901428 1375109 
     21      22       X       Y      MT 
 703589 1050461 2235845    5854     873 
kasperdanielhansen commented 1 year ago

So this is caused by the local.correct=TRUE setting in your call to BSmooth.tstat, which fits a very large bandwidth smoother. Whether or not you need this, depends on your application area, but I would not be surprised to this fail on the MT chromosome. Just to show that it works on the regular chromosomes, I would remove Y and MT and try again. (In general, you want to think about DNAm on the sex chromosomes due to X inactivation).

Best, Kasper

On Thu, Jun 8, 2023 at 2:05 PM Samuel Terkper Ahuno < @.***> wrote:

sorry to bother you all @kasperdanielhansen https://github.com/kasperdanielhansen @sara-sajjadi https://github.com/sara-sajjadi @PeteHaitch https://github.com/PeteHaitch Que - but how did you remove non-NA values or how did you solve it? i tried filtering out regions with less than 2x coverage in each group but that didn't work could you please provide some insights? thanks so much in advance!!!

############

filter out regions where 2 or more samples have less than 2x coverage

############ BS.cov <- getCoverage(bismark_bsseq) keepLoci.ex <- which(rowSums(BS.cov[, bismark_bsseq.fit$Type == "Tumor"] >= 2) >= 2 & rowSums(BS.cov[, bismark_bsseq.fit$Type == "Normal"] >= 2) >= 2) length(keepLoci.ex) bismark_bsseq.fit <- bismark_bsseq.fit[keepLoci.ex,]

bismark_bsseq.fit An object of type 'BSseq' with 51670328 methylation loci 4 samples has been smoothed with BSmooth (ns = 70, h = 1000, maxGap = 100000000) Some assays are HDF5Array-backed

##############

do tstats

###############

bismark_bsseq.tstat <- BSmooth.tstat(bismark_bsseq.fit,

  • group1 = c("SA123T_1","SA123T_2"),
  • group2 = c("SA123N_1","SA123N_2"),
  • estimate.var = "group2",
  • local.correct = TRUE,
  • verbose = TRUE) [BSmooth.tstat] preprocessing ... done in 27.2 sec [BSmooth.tstat] computing stats within groups ... done in 23.4 sec [BSmooth.tstat] computing stats across groups ... Error in h(simpleError(msg, call)) : error in evaluating the argument 'args' in selecting a method for function 'do.call': need at least two non-NA values to interpolate

############

check counts of chromosome level methylation, looks ok

############

table(seqnames(bismark_bsseq.fit))

  1       2       3       4       5       6       7       8       9      10

4249685 4046504 3068293 2720217 2734683 2726634 2868261 2442494 2175831 2519427 11 12 13 14 15 16 17 18 19 20 2395826 2422185 1451962 1592770 1539470 2032957 2150986 1258984 1901428 1375109 21 22 X Y MT 703589 1050461 2235845 5854 873

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/120#issuecomment-1583110931, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH3H36KFK4RPB242SZ3XKIH7PANCNFSM6AAAAAAYQ74MWY . You are receiving this because you were mentioned.Message ID: @.***>

-- Best, Kasper

sahuno commented 1 year ago

thanks! I'll give it a try! -S

DengEr-1993 commented 12 months ago

Hi, I'm getting the following error while computing t-statistics with BSmooth.tstat : Error in h(simpleError(msg, call)) : error in evaluating the argument 'args' in selecting a method for function 'do.call': need at least two non-NA values to interpolate. There are no NAs in my dataset.

Can you please help with troubleshooting this?

> bs.fit
An object of type 'BSseq' with
  24298794 methylation loci
  11 samples
has been smoothed with
  BSmooth (ns = 70, h = 1000, maxGap = 100000000) 
All assays are in-memory

> summary(is.na(getCoverage(bs.fit, type = "M")))
     1001             1002             1003             1004             1005             1006             1007             1008         
 Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical    Mode :logical   
 FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794   FALSE:24298794  
    1009             1010             1011         
 Mode :logical    Mode :logical    Mode :logical   
 FALSE:24298794   FALSE:24298794   FALSE:24298794  

> BS.tstat <- BSmooth.tstat(bs.fit, 
+                                     group1 = c("1002", "1003", "1005", "1006", "1007", "1008"),
+                                     group2 = c("1001", "1004", "1009",  "1010", "1011"), 
+                                     estimate.var = "same",
+                                     local.correct = TRUE,
+                                     mc.cores = 1,
+                                     verbose = TRUE)
[BSmooth.tstat] preprocessing ... done in 23.9 sec
[BSmooth.tstat] computing stats within groups ... done in 24.6 sec
[BSmooth.tstat] computing stats across groups ... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'args' in selecting a method for function 'do.call': need at least two non-NA values to interpolate

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8    LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readxl_1.4.2                BiocParallel_1.34.2         bsseq_1.36.0                SummarizedExperiment_1.30.1
 [5] Biobase_2.60.0              MatrixGenerics_1.12.0       matrixStats_0.63.0          GenomicRanges_1.52.0       
 [9] GenomeInfoDb_1.36.0         IRanges_2.34.0              S4Vectors_0.38.1            BiocGenerics_0.46.0        

loaded via a namespace (and not attached):
 [1] rjson_0.2.21              rhdf5_2.44.0              lattice_0.21-8            rhdf5filters_1.12.1       vctrs_0.6.2              
 [6] tools_4.3.0               bitops_1.0-7              parallel_4.3.0            tibble_3.2.1              fansi_1.0.4              
[11] pkgconfig_2.0.3           R.oo_1.25.0               Matrix_1.5-4              data.table_1.14.8         BSgenome_1.68.0          
[16] sparseMatrixStats_1.12.0  lifecycle_1.0.3           GenomeInfoDbData_1.2.10   compiler_4.3.0            Rsamtools_2.16.0         
[21] Biostrings_2.68.1         munsell_0.5.0             codetools_0.2-19          permute_0.9-7             snow_0.4-4               
[26] RCurl_1.98-1.12           yaml_2.3.7                pillar_1.9.0              crayon_1.5.2              R.utils_2.12.2           
[31] DelayedArray_0.26.3       limma_3.56.1              gtools_3.9.4              locfit_1.5-9.7            restfulr_0.0.15          
[36] grid_4.3.0                colorspace_2.1-0          cli_3.6.1                 magrittr_2.0.3            S4Arrays_1.0.4           
[41] XML_3.99-0.14             utf8_1.2.3                DelayedMatrixStats_1.22.0 scales_1.2.1              XVector_0.40.0           
[46] cellranger_1.1.0          R.methodsS3_1.8.2         HDF5Array_1.28.1          BiocIO_1.10.0             rtracklayer_1.60.0       
[51] rlang_1.1.1               Rcpp_1.0.10               glue_1.6.2                BiocManager_1.30.20       rstudioapi_0.14          
[56] R6_2.5.1                  Rhdf5lib_1.22.0           GenomicAlignments_1.36.0  zlibbioc_1.46.0          
> BiocManager::valid()
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: http://cran.rstudio.com/
[1] TRUE

Hello, could you tell me how to combine your 11 samples together and their original format ?

PeteHaitch commented 12 months ago

@DengEr-1993 please post as a new issue with a reproducible example.