lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

No intervals left after coverage checks. - from NormalDB.R #151

Closed yls2g13 closed 4 years ago

yls2g13 commented 4 years ago

I've attached my code and output log file below, would highly appreciate any guidance I can get on this please!

Output log file: INFO [2020-10-19 17:22:49] Loading PureCN 1.18.1... INFO [2020-10-19 17:23:08] Creating mapping bias database. INFO [2020-10-19 17:23:09] Processing variants 1 to 5000... INFO [2020-10-19 17:29:33] Processing variants 5001 to 10000... INFO [2020-10-19 17:37:13] Processing variants 10001 to 15000... INFO [2020-10-19 17:41:08] Creating normalDB. Assuming coverage files are GC-normalized. INFO [2020-10-19 17:41:52] 26 on-target bins with low coverage in all samples. WARN [2020-10-19 17:41:52] Allosome coverage missing, cannot determine sex. . . (the same lines occur for a while) . WARN [2020-10-19 17:41:52] Allosome coverage missing, cannot determine sex. INFO [2020-10-19 17:41:52] Processing on-target regions... INFO [2020-10-19 17:41:53] Removing 904 intervals with low coverage in normalDB. INFO [2020-10-19 17:41:53] Removing 1 intervals with zero coverage in more than 3% of normalDB. INFO [2020-10-19 17:41:54] Processing off-target regions... INFO [2020-10-19 17:41:54] Removing 14802 intervals with low coverage in normalDB. **FATAL [2020-10-19 17:41:54] No intervals left after coverage checks.

FATAL [2020-10-19 17:41:54]

FATAL [2020-10-19 17:41:54] This is most likely a user error due to invalid input data or

FATAL [2020-10-19 17:41:54] parameters (PureCN 1.18.1).**

My questions here are: 1) For the on-target regions, how can I find out what the 904 intervals with low coverage are? 2) How are low coverage reads defined? 3) I've tried a quite a few variations (with and without a normal-panel), but I keep getting the same error - how do I troubleshoot from here?

DIR=/software/R/4.0.2/bin
PURECN=/home/nicolesawyl/R/x86_64-pc-linux-gnu-library/4.0/PureCN/extdata
OUT_DIR=/home/nicolesawyl/PureCN
PON=/home/nicolesawyl/PureCN/MuTect2_PON.vcf.gz

#creates interval file
$DIR/Rscript $PURECN/IntervalFile.R \
--infile /home/nicolesawyl/PureCN/targetbed.sorted.bed.gz \
--fasta /resources/fdb/GRCh38_HPV/Sequence/BWAIndex/GCA_000001405.15_GRCh38_no_alt_analysis_set_HPV.fa \
--outfile $OUT_DIR/2_baits_hg38_intervals.txt \
--offtarget --genome hg38 --mintargetwidth 10 \
--export $OUT_DIR/2_baits_optimized_hg38.bed \
--mappability $OUT_DIR/GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw

#calculate and GC-normalize coverage from a list of BAM files 
$DIR/Rscript $PURECN/Coverage.R \
--outdir $OUT_DIR/Tumors \
--bam $OUT_DIR/tumors.list \
--intervals $OUT_DIR/2_baits_hg38_intervals.txt \
--cores 4

$DIR/Rscript $PURECN/Coverage.R \
--outdir $OUT_DIR/Normals \
--bam $OUT_DIR/normals.list \
--intervals $OUT_DIR/2_baits_hg38_intervals.txt \
--cores 4

#build a normal database for coverage normalization
$DIR/Rscript $PURECN/NormalDB.R \
--outdir $OUT_DIR \
--coveragefiles /home/nicolesawyl/PureCN/normal.loess.list \
--genome hg38 --normal_panel $PON
lima1 commented 4 years ago

This probably removed all off-target intervals. Is this maybe amplicon data? You can read a coverage file with PureCN::readCoverageFile in R and check.

yls2g13 commented 4 years ago

As far as I know, the off-target reads are 14802 (corresponds to grep FALSE 2_baits_hg38_intervals.txt |wc -l). I still don't know where the 904 intervals with low coverage in normalDB comes from.

This data comes from whole-exome sequencing.

What should I expect to find after reading in the coverage file? I've got it read into R: head(readfile,20)

GRanges object with 20 ranges and 5 metadata columns:
       seqnames          ranges strand |  coverage average.coverage    counts
          <Rle>       <IRanges>  <Rle> | <numeric>        <numeric> <numeric>
   [1]     chr1    10501-207117      * |         0                0         0
   [2]     chr1   258167-297419      * |         0                0         0
   [3]     chr1   586489-783453      * |         0                0         0
   [4]     chr1   783454-980418      * |         0                0         0
   [5]     chr1  980419-1177383      * |         0                0         0
   ...      ...             ...    ... .       ...              ...       ...
  [16]     chr1 2558841-2559233      * |  341753.9          869.603  2988.964
  [17]     chr1 2559234-2559626      * |  389896.8          992.104  3581.299
  [18]     chr1 2559627-2560020      * |  339601.6          861.933  2958.675
  [19]     chr1 2560610-2560729      * |   24867.2          207.226   308.919
  [20]     chr1 2561475-2561834      * |  255360.5          709.335  2072.403
       on.target duplication.rate
       <logical>        <numeric>
   [1]     FALSE               NA
   [2]     FALSE               NA
   [3]     FALSE               NA
   [4]     FALSE               NA
   [5]     FALSE               NA
   ...       ...              ...
  [16]      TRUE         0.393996
  [17]      TRUE         0.415762
  [18]      TRUE         0.431839
  [19]      TRUE         0.272358
  [20]      TRUE         0.336052

Also thank you for your patience, I'm still just starting out in bioinfo!

lima1 commented 4 years ago

Looks like you have indeed no counts in off-target bins (readfile$ontarget == FALSE). It's possible that this is amplicon, not hybrid capture whole-exome sequencing. Hybrid capture uses baits to capture exons of interest. During that step, random DNA attached to the captured DNA is extracted as well. These random, "off-target" reads provide additional information for free we can use. In your case, there is no such information. Amplicon assays use primers to amplify regions of interest, so there is no random capturing and no off-target DNA. Less likely is a removal of off-target reads in your pipeline.

Try running IntervalFile.R without --offtarget and start again. NormalDB.R should then finish successfully and generate a BED file with the 902 regions.

lima1 commented 4 years ago

Closing, I believe the developer version also gives a better warning here. Feel free to open a new issue when you run into more problems.