gstricker / GenoGAM

http://bioconductor.org/packages/devel/bioc/html/GenoGAM.html
7 stars 4 forks source link

callPeaks peakType = 'broad' : Error in pnorm(-zscore, log.p = TRUE) #2

Closed Edert closed 4 years ago

Edert commented 5 years ago

Hi, I am currently testing genogam on my simulated ChIP-seq data. Is just a few peaks from chr19 mm10. It works fine until I use callPeaks with the peakType set to broad.

The log looks like this:

> library(GenoGAM)
Loading required package: SummarizedExperiment
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

Loading required package: BiocParallel

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply

Loading required package: HDF5Array
Loading required package: rhdf5
Loading required package: Matrix

Attaching package: ‘Matrix’

The following object is masked from ‘package:S4Vectors’:

    expand

Loading required package: data.table

Attaching package: ‘data.table’

The following object is masked from ‘package:SummarizedExperiment’:

    shift

The following object is masked from ‘package:GenomicRanges’:

    shift

The following object is masked from ‘package:IRanges’:

    shift

The following objects are masked from ‘package:S4Vectors’:

    first, second

> BiocParallel::register(BiocParallel::MulticoreParam(worker = 15))
> ID<-c('S11','S12','S21','S22')
> file<-c('sample1-rep1_mm.bam','sample1-rep2_mm.bam','sample2-rep1_mm.bam','sample2-rep2_mm.bam')
> paired <- c(FALSE,FALSE,FALSE,FALSE)
> genotype<-c(0,0,1,1)
> expDesign <- data.frame(ID,file,paired,genotype)
> chunkSize <- 75000
> overhangSize <- 1000
> design = ~ s(x) + s(x, by = genotype)
> settings <- GenoGAMSettings(chromosomeList = 'chr19')
> ggd <- GenoGAMDataSet(expDesign, directory = '/', chunkSize = chunkSize, overhangSize = overhangSize, design=design, settings=settings, hdf5 = TRUE)
INFO [2019-03-13 07:20:34] Creating GenoGAMDataSet
INFO [2019-03-13 07:20:35] Reading in data
INFO [2019-03-13 07:20:35] Reading in S11
INFO [2019-03-13 07:20:44] Reading in S12
INFO [2019-03-13 07:20:51] Reading in S21
INFO [2019-03-13 07:20:56] Reading in S22
INFO [2019-03-13 07:21:03] Finished reading in data
INFO [2019-03-13 07:21:04] Writing chr19 to HDF5
INFO [2019-03-13 07:22:55] chr19 written
INFO [2019-03-13 07:22:57] GenoGAMDataSet created
> ggd <- computeSizeFactors(ggd)
INFO [2019-03-13 07:22:57] Computing size factors
INFO [2019-03-13 07:22:57] DONE
> result <- genogam(ggd)
INFO [2019-03-13 07:22:57] Initializing the model
INFO [2019-03-13 07:22:57] Done
INFO [2019-03-13 07:22:58] Estimating hyperparameters
  Nelder-Mead direct search function minimizer
function value for initial parameters = 1242.880953
  Scaled convergence tolerance is 1.85204e-05
Stepsize computed as 0.639693
BUILD              3 1249.677236 1233.402512
EXTENSION          5 1242.880953 1226.901914
LO-REDUCTION       7 1233.402512 1226.901914
LO-REDUCTION       9 1227.757696 1226.901914
HI-REDUCTION      11 1227.391931 1226.901914
REFLECTION        13 1227.360037 1226.817293
LO-REDUCTION      15 1226.978902 1226.817293
HI-REDUCTION      17 1226.901914 1226.817293
HI-REDUCTION      19 1226.824128 1226.816182
SHRINK            23 1226.870207 1226.816182
HI-REDUCTION      25 1226.816605 1226.816182
REFLECTION        27 1226.816260 1226.766515
HI-REDUCTION      29 1226.816182 1226.694118
HI-REDUCTION      31 1226.816029 1226.694118
LO-REDUCTION      33 1226.766520 1226.694118
REFLECTION        35 1226.766515 1226.694117
SHRINK            39 1226.816387 1226.694117
LO-REDUCTION      41 1226.816030 1226.694117
SHRINK            45 1226.816072 1226.694117
SHRINK            49 1226.816105 1226.694117
Exiting from Nelder Mead minimizer
    51 function evaluations used
INFO [2019-03-13 08:34:12] Done
INFO [2019-03-13 08:34:12] Fitting model
INFO [2019-03-13 08:34:12] Fitting chr19
INFO [2019-03-13 08:54:31] chr19 Done
INFO [2019-03-13 08:54:31] Processing Fits
INFO [2019-03-13 08:54:32] Processing done
INFO [2019-03-13 08:54:32] Finished
> computeSignificance(result)
INFO [2019-03-13 08:54:32] Computing positionwise p-values
INFO [2019-03-13 08:59:05] DONE
> peaks <- callPeaks(result, smooth = 's(x):genotype', threshold = 1)
INFO [2019-03-13 08:59:05] Calling narrow peaks
INFO [2019-03-13 10:27:30] DONE
> options(scipen = 999)
> write.table(file="mysPeaks.narrowPeak", as.data.frame(peaks$'s(x):genotype'),quote = FALSE, sep ='    ')
> peaksbr <- callPeaks(result, smooth = 's(x):genotype', threshold = 1,peakType = 'broad', cutoff = 0.75)
INFO [2019-03-13 10:27:30] Calling broad peaks
Error in pnorm(-zscore, log.p = TRUE) : 
  Non-numeric argument to mathematical function
Calls: callPeaks ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted

sessionInfo:

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] GenoGAM_2.0.2               data.table_1.12.0          
 [3] Matrix_1.2-15               HDF5Array_1.10.1           
 [5] rhdf5_2.26.2                SummarizedExperiment_1.12.0
 [7] DelayedArray_0.8.0          BiocParallel_1.16.5        
 [9] matrixStats_0.54.0          Biobase_2.42.0             
[11] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
[13] IRanges_2.16.0              S4Vectors_0.20.1           
[15] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] bit64_0.9-7              splines_3.5.2           
 [3] Formula_1.2-3            assertthat_0.2.0        
 [5] latticeExtra_0.6-28      blob_1.1.1              
 [7] Rsamtools_1.34.0         GenomeInfoDbData_1.2.0  
 [9] pillar_1.3.1             RSQLite_2.1.1           
[11] backports_1.1.3          lattice_0.20-38         
[13] glue_1.3.0               digest_0.6.18           
[15] RColorBrewer_1.1-2       XVector_0.22.0          
[17] checkmate_1.9.1          colorspace_1.4-0        
[19] htmltools_0.3.6          plyr_1.8.4              
[21] DESeq2_1.22.2            XML_3.98-1.16           
[23] pkgconfig_2.0.2          genefilter_1.64.0       
[25] zlibbioc_1.28.0          purrr_0.2.5             
[27] xtable_1.8-3             scales_1.0.0            
[29] htmlTable_1.13.1         tibble_2.0.1            
[31] annotate_1.60.0          ggplot2_3.1.0           
[33] nnet_7.3-12              lazyeval_0.2.1          
[35] survival_2.43-3          magrittr_1.5            
[37] crayon_1.3.4             memoise_1.1.0           
[39] foreign_0.8-71           tools_3.5.2             
[41] formatR_1.5              stringr_1.3.1           
[43] Rhdf5lib_1.4.2           locfit_1.5-9.1          
[45] munsell_0.5.0            cluster_2.0.7-1         
[47] lambda.r_1.2.3           Biostrings_2.50.2       
[49] AnnotationDbi_1.44.0     bindrcpp_0.2.2          
[51] compiler_3.5.2           rlang_0.3.1             
[53] futile.logger_1.4.3      grid_3.5.2              
[55] RCurl_1.95-4.11          rstudioapi_0.9.0        
[57] htmlwidgets_1.3          bitops_1.0-6            
[59] base64enc_0.1-3          gtable_0.2.0            
[61] DBI_1.0.0                R6_2.3.0                
[63] GenomicAlignments_1.18.1 gridExtra_2.3           
[65] knitr_1.21               dplyr_0.7.8             
[67] bit_1.1-14               bindr_0.1.1             
[69] Hmisc_4.1-1              futile.options_1.0.1    
[71] stringi_1.2.4            Rcpp_1.0.0              
[73] geneplotter_1.60.0       rpart_4.1-13            
[75] acepack_1.4.1            tidyselect_0.2.5        
[77] xfun_0.4
gstricker commented 5 years ago

Hi,

thanks for using GenoGAM. The peak calling functionality was completely rewritten and it still lacks some code to ensure robustness. Usually the pnorm function can handle a lot of non-standard inputs (like missing values or NaNs), so for some reason it gets input that cannot be converted to numerics.

Unfortunately I didn't implement enough debug messages in the peak calling functions yet to make debugging easy, so it will need some more effort to figure this out. Is there a feasible way to share your data with me (if it's not too large)? This would be the fastest way. Otherwise I would kindly ask you to execute the following debugging steps and post me the result:

  1. Just before you run callPeaks in broad mode activate debug mode: debug(GenoGAM:::.callBroadPeaks_hdf5) and debug(GenoGAM:::.callBroadPeaks_split_hdf5) I'm not entirely sure which of those functions will be used, but from the log I'm assuming the latter.

  2. Run callPeaks in broad mode: peaksbr <- callPeaks(result, smooth = 's(x):genotype', threshold = 1,peakType = 'broad', cutoff = 0.75)

  3. As soon as one of the above functions is called you will enter debug mode. You can proceed line by line by just hitting "Enter" or 'n' (for next).

  4. Usually the next line to be executed will be shown to you before you can execute it. Wait till the line that throws the error is shown: pvals <- -pnorm(-zscore, log.p = TRUE)

  5. Before executing the above line type str(zscore) to take a look at the structure of the zscore variable. Copy and post results here.

Thx, Juri

Edert commented 5 years ago

Hi,

thanks for the quick reply!

This is what I get:

debug: pvals <- -pnorm(-zscore, log.p = TRUE)
Browse[2]> str(zscore)
Formal class 'DelayedMatrix' [package "DelayedArray"] with 1 slot
  ..@ seed:Formal class 'DelayedNaryIsoOp' [package "DelayedArray"] with 3 slots
  .. .. ..@ OP   :function (e1, e2)  
  .. .. ..@ Rargs: list()
  .. .. ..@ seeds:List of 2
  .. .. .. ..$ :Formal class 'DelayedDimnames' [package "DelayedArray"] with 2 slots
  .. .. .. .. .. ..@ dimnames:List of 2
  .. .. .. .. .. .. ..$ : int -1
  .. .. .. .. .. .. ..$ : chr "s(x):genotype"
  .. .. .. .. .. ..@ seed    :Formal class 'DelayedUnaryIsoOpStack' [package "DelayedArray"] with 2 slots
  .. .. .. .. .. .. .. ..@ OPS :List of 1
  .. .. .. .. .. .. .. .. ..$ :function (a)  
  .. .. .. .. .. .. .. ..@ seed:Formal class 'DelayedSubset' [package "DelayedArray"] with 2 slots
  .. .. .. .. .. .. .. .. .. ..@ index:List of 2
  .. .. .. .. .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. .. .. .. .. ..$ : int 2
  .. .. .. .. .. .. .. .. .. ..@ seed :Formal class 'HDF5ArraySeed' [package "HDF5Array"] with 5 slots
  .. .. .. .. .. .. .. .. .. .. .. ..@ filepath : chr "/tmp/RtmpJ6LhJ4/HDF5Array_dump/fits_chr19_20190314.h5"
  .. .. .. .. .. .. .. .. .. .. .. ..@ name     : chr "fits"
  .. .. .. .. .. .. .. .. .. .. .. ..@ dim      : int [1:2] 61431566 2
  .. .. .. .. .. .. .. .. .. .. .. ..@ first_val: num 0
  .. .. .. .. .. .. .. .. .. .. .. ..@ chunkdim : int [1:2] 75000 2
  .. .. .. ..$ :Formal class 'DelayedDimnames' [package "DelayedArray"] with 2 slots
  .. .. .. .. .. ..@ dimnames:List of 2
  .. .. .. .. .. .. ..$ : int -1
  .. .. .. .. .. .. ..$ : chr "s(x):genotype"
  .. .. .. .. .. ..@ seed    :Formal class 'DelayedUnaryIsoOpStack' [package "DelayedArray"] with 2 slots
  .. .. .. .. .. .. .. ..@ OPS :List of 3
  .. .. .. .. .. .. .. .. ..$ :function (a)  
  .. .. .. .. .. .. .. .. ..$ :function (a)  
  .. .. .. .. .. .. .. .. ..$ :function (a)  
  .. .. .. .. .. .. .. ..@ seed:Formal class 'DelayedSubset' [package "DelayedArray"] with 2 slots
  .. .. .. .. .. .. .. .. .. ..@ index:List of 2
  .. .. .. .. .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. .. .. .. .. ..$ : int 2
  .. .. .. .. .. .. .. .. .. ..@ seed :Formal class 'HDF5ArraySeed' [package "HDF5Array"] with 5 slots
  .. .. .. .. .. .. .. .. .. .. .. ..@ filepath : chr "/tmp/RtmpJ6LhJ4/HDF5Array_dump/fits_chr19_20190314.h5"
  .. .. .. .. .. .. .. .. .. .. .. ..@ name     : chr "ses"
  .. .. .. .. .. .. .. .. .. .. .. ..@ dim      : int [1:2] 61431566 2
  .. .. .. .. .. .. .. .. .. .. .. ..@ first_val: num 0.103
  .. .. .. .. .. .. .. .. .. .. .. ..@ chunkdim : int [1:2] 75000 2

If this does not help I will try to find a set which is rather small to share it with you, any preferences where I should upload it?

Best, Thomas

gstricker commented 5 years ago

Hi Thomas,

sorry for the late reply, I was gone over the weekend. It seems zscore is a DelayedMatrix instead of a numeric vector.

Could you repeat the same steps, but instead of str(zscore), do pvals <- -pnorm(-zscore[,1], log.p = TRUE). This should convert the HDF5Matrix into a numeric vector and compute the pvalues.

If this works, just do zscore <- zscore[,1] and then continue the debug mode till the end (it should be just a few more lines). The final object res should then be a data.table of broad peaks.

Let me know if this works and I will push the change as a bug fix to Bioconductor.

Best, Juri

Edert commented 5 years ago

Hi Juri,

yes, looks like its working now. At least I get p-values and therefor an peaksbr object.

str(pvals)
 num [1:61431566] 0.678 0.678 0.678 0.678 0.678 ...

I also checked some of the regions with the highest score in the alignment and could see a signal there.

Best, Thomas

gstricker commented 5 years ago

Perfect. I have now pushed the changes. I hope that works now

Best, Juri

PS: I noticed for broad peak calling you are using the same parameters as in the vignette. Those are just to demonstrate the functionality. A cutoff of 0.75 and threshold of 1, will pretty much return you the entire chromosome.