Error at the end of getBam Counts and output is not S4.

nelsonchanhk commented 4 years ago

Hello, I'm a new to ExomeDepth and is currently try it on my custom targeted capture panel. At the end of getBamCounts, the following error shows up.

Now parsing ~/OPERATION/RedCellNGS_hg19/BAM_testing/Normal9.bam Parsing chromosome chr1 Parsing chromosome chr10 Parsing chromosome chr11 Parsing chromosome chr12 Parsing chromosome chr13 Parsing chromosome chr14 Parsing chromosome chr15 Parsing chromosome chr16 Parsing chromosome chr17 Parsing chromosome chr18 Parsing chromosome chr19 Parsing chromosome chr2 Parsing chromosome chr20 Parsing chromosome chr21 Parsing chromosome chr22 Parsing chromosome chr3 Parsing chromosome chr4 Parsing chromosome chr5 Parsing chromosome chr6 Parsing chromosome chr7 Parsing chromosome chr8 Parsing chromosome chr9 Number of counted fragments : 836974 There were 50 or more warnings (use warnings() to see the first 50)

warnings() Warning messages: 1: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated] 2: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated] 3: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated] 4: In .Seqinfo.mergexy(x, y) :

Furthermore, the output of getBamCounts in my case is a list and when I tried to convert it to a data frame, it failed with the following error.

ExomeCount.dafr <- as(ExomeCount, 'data.frame') Error in as(ExomeCount, "data.frame") : internal problem in as(): “tbl_df” is(object, "data.frame") is TRUE, but the metadata asserts that the 'is' relation is FALSE typeof(ExomeCount) [1] "list"

My script is as follows. data(exons.hg19) hg19<-file.path("~/OPERATION/RedCellNGS_hg19/ref","hg19.fa") bamName <- list.files("~/OPERATION/RedCellNGS_hg19/BAM", pattern = '*.bam$') bamFile <- file.path("~/OPERATION/RedCellNGS_hg19/BAM",bamName) ExomeCount <- getBamCounts(bed.frame = exons.hg19,bam.files = bamFile,include.chr = TRUE,referenceFasta = hg19)

My sessionInfo() is as follows. R version 3.6.2 (2019-12-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] ExomeDepth_1.1.13

loaded via a namespace (and not attached): [1] matrixStats_0.55.0 lattice_0.20-38
[3] IRanges_2.20.1 Rsamtools_2.2.1
[5] Biostrings_2.54.0 GenomicAlignments_1.22.1
[7] bitops_1.0-6 grid_3.6.2
[9] GenomeInfoDb_1.22.0 stats4_3.6.2
[11] magrittr_1.5 zlibbioc_1.32.0
[13] XVector_0.26.0 S4Vectors_0.24.1
[15] Matrix_1.2-18 aod_1.3.1
[17] BiocParallel_1.20.1 tools_3.6.2
[19] Biobase_2.46.0 RCurl_1.95-4.12
[21] DelayedArray_0.12.1 parallel_3.6.2
[23] compiler_3.6.2 BiocGenerics_0.32.0
[25] GenomicRanges_1.38.0 SummarizedExperiment_1.16.1 [27] GenomeInfoDbData_1.2.2

Thank you very much for your help!

vplagnol commented 4 years ago

mmm I recently updated ExomeDepth to 1.13 to deal with some new R compatibility issue, and something odd may be happening with tibbles and data frames. It could be that you need to update your dplyr package? Some incompatibility there?

For now, I have done created a version 1.1.14, that does return(data.frame(exon_count_frame)) in getBamCounts so ideally you should not be bothered by the tibble issue anymore. Can you please try: > devtools::install_github("vplagnol/ExomeDepth") and tell me what happens?

nelsonchanhk commented 4 years ago

Dear vplagnol,

Thank you for your swift reply! I just tried 1.1.14 and now I can proceed with the dataframe step without the previous error though it is not a S4 object. You can see the output as follows.

ExomeCount.dafr <- as(ExomeCount, 'data.frame') typeof(ExomeCount) [1] "list" typeof(ExomeCount.dafr) [1] "list"

I shall try proceeding with the rest of the processing to see if it can work on my dataset.

Nevertheless, the >50 warnings at the end of getBamCounts still exist. Is it normal or is it related to the chr notation in my bam? I used hg19 for my bam and fasta but both of them use the chr(1,2,...,X,Y) notation.

Thank you very much in advance and wish you a happy New Year!

Now parsing ~/OPERATION/RedCellNGS_hg19/BAM_testing/Normal9.bam Parsing chromosome chr1 Parsing chromosome chr10 Parsing chromosome chr11 Parsing chromosome chr12 Parsing chromosome chr13 Parsing chromosome chr14 Parsing chromosome chr15 Parsing chromosome chr16 Parsing chromosome chr17 Parsing chromosome chr18 Parsing chromosome chr19 Parsing chromosome chr2 Parsing chromosome chr20 Parsing chromosome chr21 Parsing chromosome chr22 Parsing chromosome chr3 Parsing chromosome chr4 Parsing chromosome chr5 Parsing chromosome chr6 Parsing chromosome chr7 Parsing chromosome chr8 Parsing chromosome chr9 Number of counted fragments : 836974 There were 50 or more warnings (use warnings() to see the first 50)

warnings() Warning messages: 1: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated] 2: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated] 3: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated] 4: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other:

in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]

vplagnol commented 4 years ago

OK, so the errors I see suggest an issue between the use of the "chr" convention, probably in your BAM file and the convention "1..22" which I think is the default I use. Can you check how the output data frames look like? Do they have reasonable counts in them? Have you looked at the "include.chr" option in getBamCounts that is meant to deal with that complication?

nelsonchanhk commented 4 years ago

Hello vplagnol, your quick response is much appreciated. I have attached the screenshot of the dataframe for reference. "include.chr" was set to be true. I think the counts are reasonable at regions with coverage.

Nevertheless as I used a custom capture panel, a lot of the genes had no coverage at all, do you think it would affect the algorithm for the enumeration of CNV?

Furthermore, do you think the warnings at the end of processing is safe to bypass?

Thank you. Have a nice day and all the best in the new year!

nelsonchanhk commented 4 years ago

Hello vplagnol, sorry for another issue. When I was trying to build my reference set from my samples, I encountered the following problem.

my.choice <- select.reference.set(test.counts=my.test, reference.counts=my.reference.set, bin.length=(ExomeCount.dafr$end-ExomeCount.dafr$start)/1000, n.bins.reduced=1000) Optimization of the choice of aggregate reference set, this process can take some time Error in ExomeCount.dafr$end - ExomeCount.dafr$start : non-numeric argument to binary operator

It looks like it's another dataframe problem. After changing the attribute to numeric, the step worked by there were three warnings at the end. The sample data reference set also has the same 3 warnings at the end of processing. Is it normal?

sapply(ExomeCount.dafr, mode) seqnames start end exon GC HA1.bam "character" "numeric" "numeric" "character" "numeric" "numeric" HA2.bam HA3.bam HA4.bam HA5.bam Normal1.bam Normal2.bam "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" Normal3.bam chromosome "numeric" "character" my.choice <- select.reference.set(test.counts=my.test, reference.counts=my.reference.set, bin.length=(ExomeCount.dafr$end-ExomeCount.dafr$start)/1000, n.bins.reduced=1000) Optimization of the choice of aggregate reference set, this process can take some time Number of selected bins: 1000 Warning messages: 1: In model.matrix.default(mt, mfb, contrasts) : non-list contrasts argument ignored 2: In model.matrix.default(mt, mfb, contrasts) : non-list contrasts argument ignored 3: In model.matrix.default(mt, mfb, contrasts) : non-list contrasts argument ignored

Furthermore, at the end of ExomeDepth CNV calling, there are two warning messages.

all.exons <- new('ExomeDepth', test = my.test,reference = my.reference.selected, formula = 'cbind(test, reference) ~ 1') Now fitting the beta-binomial model on a data frame with 185130 rows : this step can take a few minutes. Now computing the likelihood for the different copy number states Warning messages: 1: In aod::betabin(data = data.for.fit, formula = as.formula(formula), : The data set contains at least one line with weight = 0.

2: In model.matrix.default(mt, mfb, contrasts) : non-list contrasts argument ignored

Are the these warnings safe to bypass?

Thank you very much!

vplagnol commented 4 years ago

Oops, sorry, in my rewrite of the code I stupidly converted start/end to character. Now fixed in 1.1.15. Thank you for picking this up!

To your earlier question, if you use a custom panel, I would restrict the BED file to the covered regions to avoid all the 0s. Not sure how much it really matters, but it may, so play it safe...

The statement "non-list contrasts argument ignored" concerns me... but I am not sure where it comes from. It must be from: mod <- aod::betabin( data = data.for.fit, formula = as.formula(formula), random = ~ 1, link = 'logit', warnings = FALSE) perhaps because it does not want to correct for GC content? I am not sure but it is worrying.

nelsonchanhk commented 4 years ago

That's perfect @vplagnol , I've got it working now.

Just a couple of quick questions. As I'm working on a gremlin targeted capture panel of about 7M with 10 normal controls and 160 samples (I presume about 20-30% of them have CNVs of different genes) The samples were run on two occasions following same protocol and on the same sequencer

for n.bin.reduced, should I still use 10000? how can I vary it to improve sensitivity/specificity?
Shall I use the 10 samples as reference set or should I use all 170 bam (170-1) for control?

Your support is truly appreciated! Wish you another prosperous year ahead!

vplagnol commented 4 years ago

So for question 1, n.bins.reduced is useful when you have more than 10K bins with non-zero count, to speed things up. If you have more, downsampling should be fine, and if you have fewer, nothing should happen. So I would leave that parameter as is.

Re question 2 and the choice of reference samples, much of the accuracy of ExomeDepth depends on tight correlations between the test and reference samples. So in general it is best to have more BAM files, in order to find the most closely technically matched samples (usually the ones run in the same sequencing batch). The only exception is when you expect many CNVs in the same locations across your cases, which would obviously make it difficult to distinguish CNVs in the test sample (as the reference samples would be similarly affected). So the general answer is to NOT restrict, unless you have a very specific and well defined genetic cause of disease and you always look for relatively common CNVs in the same few genes/exons.

Thank you for the good words and please let me know if you find what you expect to see, with the recent changes some confirmation that nothing is broken would good.

nelsonchanhk commented 4 years ago

@vplagnol That's very clear and helpful!

I've tried on my sample, known deletions of about 20K and 3K-4K were not detected. I did an exploratory analysis on my samples which no SNV/small indels could account for the phenotype. Unfortunately, no particular relevant CNVs were detected.

On the other hand, at the reference building stage, I setup my variables similar to the way that you did in the tutorial, except I used my on primary target bed. When I go through the variable my.choice, i.e. output of select.reference.set, in my set of 176 samples, only about the top 10 odd samples show up as meaningful values and the rest are all labelled as NA. Is it normal? or is there some problem with my script?

Furthermore, can ExomeDepth detect deletions of 20K and 3K-4K? How should I optimise the bin size to detect CNV of such sizes?

Sorry for my many questions but you really guided me through using the ExomeDepth and I'm much obliged to you.

my.ref.samples <- c('X11H28680.2.bam', 'X12H28270.5.bam', 'X12H28309.8.bam', 'X12H28470.7.bam', 'X12H28618.7.bam', 'X12H28723.6.bam', 'X12H28754.2.bam', 'X13H28294.9.bam', 'X13H28372.0.bam', 'X13H28504.5.bam', 'X13H28587.0.bam', 'X13H28619.4.bam', 'X13H28944.1.bam', 'X13H28944.1R.bam', 'X14H0289496.bam', 'X14H28307.4.bam', 'X14H28348.3.bam', 'X14H28384.7.bam', 'X14H28448.4.bam', 'X14H28534.4.bam', 'X14H28539.9.bam', 'X14H28622.8.bam', 'X14H28625.9.bam', 'X14H28626.6.bam', 'X14H28698.1.bam', 'X14H28699.8.bam', 'X14H28844.0.bam', 'X14H28947.2.bam', 'X14H28999.1.bam', 'X14H29009.2.bam', 'X15H28027.7.bam', 'X15H28179.7.bam', 'X15H28244.4.bam', 'X15H28299.4.bam', 'X15H28324.9.bam', 'X15H28349.0.bam', 'X15H28381.6.bam', 'X15H28452.5.bam', 'X15H28458.7.bam', 'X15H28547.8.bam', 'X15H28548.5.bam', 'X15H28571.5.bam', 'X15H28577.7.bam', 'X15H28691.2.bam', 'X15H28714.0.bam', 'X15H28715.7.bam', 'X15H28823.7.bam', 'X15H28908.7.bam', 'X15H28910.4.bam', 'X15H29145.7.bam', 'X16H28053.8.bam', 'X16H28186.9.bam', 'X16H28202.8.bam', 'X16H28241.3.bam', 'X16H28242.0.bam', 'X16H28304.3.bam', 'X16H28327.0.bam', 'X16H28379.9.bam', 'X16H28380.9.bam', 'X16H28411.6.bam', 'X16H28412.3.bam', 'X16H28462.8.bam', 'X16H28464.2.bam', 'X16H28488.6.bam', 'X16H28517.9.bam', 'X16H28540.9.bam', 'X16H28622.8.bam', 'X16H28647.9.bam', 'X16H28652.7.bam', 'X16H28656.5.bam', 'X16H28677.8.bam', 'X16H28706.1.bam', 'X16H28732.2.bam', 'X16H28786.5.bam', 'X16H28787.2.bam', 'X16H28818.9.bam', 'X16H28820.6.bam', 'X16H29092.0.bam', 'X16H29122.0.bam', 'X16H29190.7.bam', 'X16H29240.3.bam', 'X16H29325.3.bam', 'X16H29327.7.bam', 'X17H027436.0.bam', 'X17H027437.7.bam', 'X17H027439.1.bam', 'X17H028279.8.bam', 'X17H028305.0.bam', 'X17H028363.4.bam', 'X17H028450.1.bam', 'X17H028452.5.bam', 'X17H028465.9.bam', 'X17H028514.8.bam', 'X17H028529.6.bam', 'X17H028531.3.bam', 'X17H028589.4.bam', 'X17H028623.5.bam', 'X17H028626.6.bam', 'X17H028669.9.bam', 'X17H028707.8.bam', 'X17H028708.5.bam', 'X17H028710.2.bam', 'X17H028733.9.bam', 'X17H028736.0.bam', 'X17H028737.7.bam', 'X17H028763.8.bam', 'X17H028769.0.bam', 'X17H028776.2.bam', 'X17H028827.5.bam', 'X17H028832.3.bam', 'X17H028861.5.bam', 'X17H028883.5.bam', 'X17H029017.1.bam', 'X17H029119.6.bam', 'X17H029120.6.bam', 'X17H029144.0.bam', 'X17H029179.4.bam', 'X17H029267.8.bam', 'X17H029322.2.bam', 'X17H029354.5.bam', 'X17H029403.4.bam', 'X17H029472.8.bam', 'X17H28050.7.bam', 'X17H28052.1.bam', 'X17H28102.7.bam', 'X17H28120.9.bam', 'X17H286651.bam', 'X18H028007.1.bam', 'X18H028197.9.bam', 'X18H028461.1.bam', 'X18H028488.6.bam', 'X18H028593.5.bam', 'X19H21103.bam', 'HA1.bam', 'HA10.bam', 'HA11.bam', 'HA12.bam', 'HA13.bam', 'HA14.bam', 'HA15.bam', 'HA16.bam', 'HA17.bam', 'HA18.bam', 'HA19.bam', 'HA2.bam', 'HA20.bam', 'HA21.bam', 'HA22.bam', 'HA23.bam', 'HA3.bam', 'HA4.bam', 'HA5.bam', 'HA6.bam', 'HA7.bam', 'HA8.bam', 'HA9.bam', 'HBD1.bam', 'HBD2.bam', 'HBD3.bam', 'HBD4.bam', 'HBD5.bam', 'HBD6.bam', 'HBD7.bam', 'HBD8.bam', 'Normal1.bam', 'Normal10.bam', 'Normal2.bam', 'Normal3.bam', 'Normal4.bam', 'Normal5.bam', 'Normal6.bam', 'Normal7.bam', 'Normal8.bam', 'Normal9.bam', 'RA35.bam') my.reference.set <- as.matrix(ExomeCount.dafr[, my.ref.samples])

my.test <- ExomeCount.dafr$UW684347.bam

my.choice <- select.reference.set(test.counts=my.test, reference.counts=my.reference.set, bin.length=(ExomeCount.dafr$end-ExomeCount.dafr$start)/1000) Optimization of the choice of aggregate reference set, this process can take some time Number of selected bins: 2008 There were 14 warnings (use warnings() to see them) my.matrix <- as.matrix(ExomeCount.dafr[, my.choice$reference.choice, drop = FALSE]) my.reference.selected <- apply(X = my.matrix, MAR = 1, FUN = sum) all.exons <- new('ExomeDepth', test = my.test,reference = my.reference.selected, formula = 'cbind(test, reference) ~ 1') Now fitting the beta-binomial model on a data frame with 2232 rows : this step can take a few minutes. Now computing the likelihood for the different copy number states Warning message: In model.matrix.default(mt, mfb, contrasts) : non-list contrasts argument ignored all.exons <- CallCNVs(x = all.exons, transition.probability = 10^-4, chromosome = ExomeCount.dafr$chromosome, start = ExomeCount.dafr$start, end = ExomeCount.dafr$end, name = ExomeCount.dafr$exon) Correlation between reference and tests count is 0.99962 To get meaningful result, this correlation should really be above 0.97. If this is not the case, consider the output of ExomeDepth as less reliable (i.e. most likely a high false positive rate) Number of calls for chromosome 2 : 1 Number of calls for chromosome 6 : 1 Number of calls for chromosome 17 : 1 all.exons <- AnnotateExtra(x = all.exons, reference.annotation = Conrad.hg19.common.CNVs, min.overlap = 0.5, column.name = 'Conrad.hg19') exons.hg19.GRanges <- GenomicRanges::GRanges(seqnames = exons.hg19$chromosome,IRanges::IRanges(start=exons.hg19$start,end=exons.hg19$end),names = exons.hg19$name) all.exons <- AnnotateExtra(x = all.exons,reference.annotation = exons.hg19.GRanges,min.overlap = 0.0001,column.name = 'exons.hg19') head(all.exons@CNV.calls[ order ( all.exons@CNV.calls$BF, decreasing = TRUE),]) start.p end.p type nexons start end chromosome 2 544 549 duplication 6 32520752 32552155 6 1 250 250 deletion 1 175547502 175547644 2 3 1925 1926 deletion 2 46115033 46115428 17 id BF reads.expected reads.observed reads.ratio 2 chr6:32520752-32552155 21.20 369 589 1.600 1 chr2:175547502-175547644 5.82 58 22 0.379 3 chr17:46115033-46115428 5.06 190 125 0.658 Conrad.hg19 2 CNVR2845.21,CNVR2845.25,CNVR2845.24,CNVR2845.5 1 3 exons.hg19 2 HLA-DRB1_6,HLA-DRB1_5,HLA-DRB1_4,HLA-DRB1_3,HLA-DRB1_2 1 3 COPZ2_2,COPZ2_1 my.choice $reference.choice [1] "X17H028861.5.bam" "HA6.bam" "X14H28625.9.bam" "X19H21103.bam"
[5] "X15H28349.0.bam" "X14H28698.1.bam"

$summary.stats ref.samples correlations expected.BF phi X17H028861.5.bam X17H028861.5.bam 0.9696150 4.306852 0.0010101030 HA6.bam HA6.bam 0.9684568 6.261057 0.0006421234 X14H28625.9.bam X14H28625.9.bam 0.9683212 6.924948 0.0005043947 X19H21103.bam X19H21103.bam 0.9681989 7.331545 0.0004154194 X15H28349.0.bam X15H28349.0.bam 0.9678799 7.440957 0.0003745715 X14H28698.1.bam X14H28698.1.bam 0.9678143 7.513889 0.0003396163 Normal5.bam Normal5.bam 0.9676507 7.473663 0.0003132495 X18H028488.6.bam X18H028488.6.bam 0.9672575 7.398554 0.0002966433 X15H28547.8.bam X15H28547.8.bam 0.9667066 7.377088 0.0002709542 X17H029322.2.bam X17H029322.2.bam 0.9661038 7.349389 0.0002555399 HA2.bam HA2.bam 0.9659973 7.244257 0.0002428891 Normal8.bam Normal8.bam 0.9659702 7.208335 0.0002312084 X17H028707.8.bam X17H028707.8.bam 0.9658821 7.243164 0.0002159127 HBD1.bam HBD1.bam 0.9657609 NA 0.0002010216 HA17.bam HA17.bam 0.9653922 NA NA X14H28448.4.bam X14H28448.4.bam 0.9653091 NA NA Normal7.bam Normal7.bam 0.9652204 NA NA X16H28820.6.bam X16H28820.6.bam 0.9651466 NA NA HA11.bam HA11.bam 0.9650629 NA NA X17H028529.6.bam X17H028529.6.bam 0.9650061 NA NA X13H28294.9.bam X13H28294.9.bam 0.9647416 NA NA X13H28504.5.bam X13H28504.5.bam 0.9647169 NA NA X14H28384.7.bam X14H28384.7.bam 0.9646008 NA NA X14H28622.8.bam X14H28622.8.bam 0.9639411 NA NA X16H28517.9.bam X16H28517.9.bam 0.9638205 NA NA HA21.bam HA21.bam 0.9637785 NA NA X15H28691.2.bam X15H28691.2.bam 0.9637432 NA NA X15H28714.0.bam X15H28714.0.bam 0.9632372 NA NA X15H28452.5.bam X15H28452.5.bam 0.9631393 NA NA

vplagnol commented 4 years ago

At first sight that correlation level that you find (0.99962) seems too high. High is good, but that extent high suggests that something is wrong. As if your reference sample exactly predicted the test sample. Odd... Are you sure that the duplicate(s) of the test are NOT included in the potential reference set? I am really worried about this message: "In model.matrix.default(mt, mfb, contrasts) : non-list contrasts argument ignored"

which I have not seen before. Also that 0.99962 does not seem to match the numbers I see in the summary stats table, which are more realistic. And also with such super high correlations it would explain why you see so few CNV calls, basically the match is nearly perfect. I am a bit concerned...

nelsonchanhk commented 4 years ago

@vplagnol That's a good suggestion. I've checked they were not duplicates and sample was excluded from the reference set. Could it be possible that I have more than 170 bams that the algorithm was able to select a bam of very high correlation?

Previously I used the bams with GATK, cnvkit, viscap and several other programs the results appeared to correlate with other programs. I think the problem I'm now having might be related to the worrisome error message. Is there anyway to troubleshoot the error?

I will also try several other reference set combinations toexplore. Thank you and have a nice day!

nelsonchanhk commented 4 years ago

Hello @vplagnol, the problem was partly resolved.

In my bed, there were several big chromosomal regions which were not separated by genes/exons. They were mistakenly treated as a big exon. I've worked on my bed file and apparently the correlation problem has resolved. I can also detect the deletions mentioned earlier which were buried in the large chromosomal regions.

However, the error still persists. "In model.matrix.default(mt, mfb, contrasts) : non-list contrasts argument ignored"

vplagnol commented 4 years ago

OK that's good news. The fact that you can see at least some of the CNVs you are meant to see tells me that the code is not broken, something good is happening. Still a bit unsure about that error message, but I think I am comfortable enough to push the updated code to CRAN. I'll try to figure out what the error message is telling me, but really happy to hear things are now working, so thank you for this.

vplagnol commented 4 years ago

Uploading 1.1.15 now to CRAN. I will close the issue unless something else comes up (in which case re-open please).

vplagnol / ExomeDepth

Error at the end of getBam Counts and output is not S4. #22