Closed yls2g13 closed 4 years ago
This probably removed all off-target intervals. Is this maybe amplicon data? You can read a coverage file with PureCN::readCoverageFile in R and check.
As far as I know, the off-target reads are 14802 (corresponds to grep FALSE 2_baits_hg38_intervals.txt |wc -l
). I still don't know where the 904 intervals with low coverage in normalDB
comes from.
This data comes from whole-exome sequencing.
What should I expect to find after reading in the coverage file? I've got it read into R:
head(readfile,20)
GRanges object with 20 ranges and 5 metadata columns:
seqnames ranges strand | coverage average.coverage counts
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
[1] chr1 10501-207117 * | 0 0 0
[2] chr1 258167-297419 * | 0 0 0
[3] chr1 586489-783453 * | 0 0 0
[4] chr1 783454-980418 * | 0 0 0
[5] chr1 980419-1177383 * | 0 0 0
... ... ... ... . ... ... ...
[16] chr1 2558841-2559233 * | 341753.9 869.603 2988.964
[17] chr1 2559234-2559626 * | 389896.8 992.104 3581.299
[18] chr1 2559627-2560020 * | 339601.6 861.933 2958.675
[19] chr1 2560610-2560729 * | 24867.2 207.226 308.919
[20] chr1 2561475-2561834 * | 255360.5 709.335 2072.403
on.target duplication.rate
<logical> <numeric>
[1] FALSE NA
[2] FALSE NA
[3] FALSE NA
[4] FALSE NA
[5] FALSE NA
... ... ...
[16] TRUE 0.393996
[17] TRUE 0.415762
[18] TRUE 0.431839
[19] TRUE 0.272358
[20] TRUE 0.336052
Also thank you for your patience, I'm still just starting out in bioinfo!
Looks like you have indeed no counts in off-target bins (readfile$ontarget == FALSE). It's possible that this is amplicon, not hybrid capture whole-exome sequencing. Hybrid capture uses baits to capture exons of interest. During that step, random DNA attached to the captured DNA is extracted as well. These random, "off-target" reads provide additional information for free we can use. In your case, there is no such information. Amplicon assays use primers to amplify regions of interest, so there is no random capturing and no off-target DNA. Less likely is a removal of off-target reads in your pipeline.
Try running IntervalFile.R without --offtarget and start again. NormalDB.R should then finish successfully and generate a BED file with the 902 regions.
Closing, I believe the developer version also gives a better warning here. Feel free to open a new issue when you run into more problems.
I've attached my code and output log file below, would highly appreciate any guidance I can get on this please!
Output log file: INFO [2020-10-19 17:22:49] Loading PureCN 1.18.1... INFO [2020-10-19 17:23:08] Creating mapping bias database. INFO [2020-10-19 17:23:09] Processing variants 1 to 5000... INFO [2020-10-19 17:29:33] Processing variants 5001 to 10000... INFO [2020-10-19 17:37:13] Processing variants 10001 to 15000... INFO [2020-10-19 17:41:08] Creating normalDB. Assuming coverage files are GC-normalized. INFO [2020-10-19 17:41:52] 26 on-target bins with low coverage in all samples. WARN [2020-10-19 17:41:52] Allosome coverage missing, cannot determine sex. . . (the same lines occur for a while) . WARN [2020-10-19 17:41:52] Allosome coverage missing, cannot determine sex. INFO [2020-10-19 17:41:52] Processing on-target regions... INFO [2020-10-19 17:41:53] Removing 904 intervals with low coverage in normalDB. INFO [2020-10-19 17:41:53] Removing 1 intervals with zero coverage in more than 3% of normalDB. INFO [2020-10-19 17:41:54] Processing off-target regions... INFO [2020-10-19 17:41:54] Removing 14802 intervals with low coverage in normalDB. **FATAL [2020-10-19 17:41:54] No intervals left after coverage checks.
FATAL [2020-10-19 17:41:54]
FATAL [2020-10-19 17:41:54] This is most likely a user error due to invalid input data or
FATAL [2020-10-19 17:41:54] parameters (PureCN 1.18.1).**
My questions here are: 1) For the on-target regions, how can I find out what the 904 intervals with low coverage are? 2) How are low coverage reads defined? 3) I've tried a quite a few variations (with and without a normal-panel), but I keep getting the same error - how do I troubleshoot from here?