Closed drmrgd closed 3 years ago
Thanks Dave, I'll have a look later today. If you remove the header from your callable file, it doesn't fix it, right?
Yeah, that's correct, Markus. I've tried all combinations (with and without header, with and without chr
string) and all iterations seem to throw the same error.
Edit I think I may have figured this out. I think it could be that the extra columns in the Callstate output may be causing some consternation. The head of the Callstate file looks like:
$ head callable_no-chr_no-header.bed
1 13342 13350 CALLABLE 961.25 386.25 452.88
1 13448 13527 CALLABLE 3530.61 2250.28 1120.23
1 13508 13531 CALLABLE 3569.65 2083.22 1365.00
1 69404 69438 CALLABLE 847.03 572.65 272.65
1 69418 69497 CALLABLE 782.68 653.05 127.57
1 69498 69577 CALLABLE 932.62 843.80 86.20
1 69549 69628 CALLABLE 977.66 837.92 137.18
1 324759 324816 CALLABLE 959.33 698.81 257.56
1 664702 664724 CALLABLE 1110.91 641.82 406.77
1 664795 664801 CALLABLE 570.33 269.50 256.50
I did some digging, and I think the columns in CallableLoci
don't include the last 3. If I trim those cols out, it seems to work:
INFO [2021-05-27 10:40:29] Loading TxDb.Hsapiens.UCSC.hg19.knownGene...
Loading required package: org.Hs.eg.db
WARN [2021-05-27 10:40:37] Chromosome naming style of txdb file (UCSC) was different from interval file (Ensembl).
WARN [2021-05-27 10:40:46] Attempted adding gene symbols to intervals. Heuristics have been used to pick symbols for overlapping genes.
INFO [2021-05-27 10:40:46] Excluded region of size 12307bp.
INFO [2021-05-27 10:40:48] Total size of CDS region: 21.49Mb (27.93Mb input).
So, it seems like the extra columns may be the root cause of the issue.
Great, yes, rtracklayer is used to parse the BED files and can get quite easily confused.
I'll check out callstate. mosdepth is I think another one that could be a good alternative to GATK3. I didn't spend too much time on it since it doesn't make a huge difference for high coverage data, ie. sample to sample variance is low, so you might as well use the baits file directly.
It's definitely a nominal difference in at least one sample I've been playing with. So, maybe you're right and filtering to callable regions isn't completely necessary? Trying to make sure all of this is correct because I'm not getting the correct output for TMB in these samples, which I know to be TMB-high in most cases.
Also, to be sure, the resultant file from FilterCallableLoci.R
looks like this:
chr1 13448 13531 . 0 .
chr1 69404 69497 . 0 .
chr1 69498 69628 . 0 .
chr1 324759 324816 . 0 .
chr1 664702 664724 . 0 .
chr1 664795 664801 . 0 .
chr1 762160 762239 . 0 .
chr1 762250 762329 . 0 .
chr1 762410 762546 . 0 .
chr1 861321 861393 . 0 .
I do see a comment in the script stdout that suggests that there is a mapping to gene symbols. But, I don't see them in the output; is this correct or is there still an issue?
Should be fine. All it does is removing non-coding regions from the input BED file. Let me know if you find an issue here, but haven't touched that code in a long time and seems to work well.
Feel free to post a log file of the PureCN.R run for a high TMB sample that is called as low TMB. Might point you to setup issues in case there are some.
Thanks Markus! Happy to open a new issue if it's more appropriate (since this one is now resolved).
I just ran Dx.R, and did get one warning for the mutational signatures piece, but nothing for any other component:
$ Rscript $PURECN/Dx.R --rds MATCH-Z1D-B44E-TDNA.rds --callable MATCH-Z1D-B44E_callable_filtered_cds.bed --exclude /data/Essex/reference/hg19_simpleRepeats_noChr.bed --signatures --force
INFO [2021-05-27 13:07:17] Reading /gpfs/gsfs12/users/Essex/EAY131-Z1D/orig_hg19_bams/purecn_out_20210525/MATCH-Z1D-B44E-TDNA/MATCH-Z1D-B44E_callable_filtered_cds.bed...
INFO [2021-05-27 13:07:18] Reading /gpfs/gsfs12/users/Essex/reference/hg19_simpleRepeats_noChr.bed...
INFO [2021-05-27 13:07:20] Loading PureCN...
INFO [2021-05-27 13:07:20] Reading /gpfs/gsfs12/users/Essex/EAY131-Z1D/orig_hg19_bams/purecn_out_20210525/MATCH-Z1D-B44E-TDNA/MATCH-Z1D-B44E-TDNA.rds...
INFO [2021-05-27 13:07:22] Calling mutation burden...
INFO [2021-05-27 13:07:32] Calling chromosomal instability...
Loading required package: deconstructSigs
INFO [2021-05-27 13:07:42] deconstructSigs package found.
INFO [2021-05-27 13:07:42] Using signatures.exome.cosmic.v3.may2019 signature database.
WARN [2021-05-27 13:07:42] Not enough somatic calls to deconstruct signatures.
I'm a little surprised about that result as I'm quite certain there are a fairly large number of somatic calls in this specimen. Additionally, the result of the mutation burden output shows the somatic (and germline for that matter!) rate at 0, but my own TMB calculation for this specimen was 26.64 Mutations / Mb:
$ transpose.pl -d , MATCH-Z1D-B44E-TDNA_mutation_burden.csv
Sampleid,MATCH-Z1D-B44E-TDNA
somatic.ontarget,0
somatic.all,0
private.germline.ontarget,0
private.germline.all,0
callable.bases.ontarget,20268128
callable.bases.flanking,20286296
callable.bases.all,21034938
somatic.rate.ontarget,0
somatic.rate.ontarget.95.lower,0
somatic.rate.ontarget.95.upper,0.182003938322209
private.germline.rate.ontarget,0
private.germline.rate.ontarget.95.lower,0
private.germline.rate.ontarget.95.upper,0.182003938322209
I suspect there is an issue with my GATK4 Mutect2 VCF or something like that, but I'm not sure. The TMB calculation I made was from the MAF generated by the GDC somatic workflow with a few additional filters to remove non-coding alterations and the like. But, of course, what I'm calling and what PureCN calls somatic alterations could be different; I just didn't expect to be so different!
That is odd. Can you share the output (log file) of the corresponding PureCN.R run?
Sure, no problem! I'm sure I've got something set up incorrectly or something. Here's that log: MATCH-Z1D-B44E-TDNA.log
It's also interesting that I do see a TMB value > 0 for three samples in my cohort. All three are tumor only samples that have no paired normal. I built a normal DB previously, and ran these all through (27 in total) using that normal DB with the hope that I can get similar enough data from the tumor only to the paired set for analysis. Something I'm doing wrong, though, is causing the data for the paired samples to have issues compared to the tumor only samples. Maybe that's a clue?
No real red flags. The fraction of intervals with SNPs is low, you probably did not run Mutect with --interval_padding, right? But that does not explain the 0 germline/somatic rate.
When you overlap the somatic mutations in the MAF file with the Sampleid_variants.csv file, what do you get? Are these filtered out (present in input Mutect VCF, but not showing up in the variants file), or flagged?
No, I didn't run with interval padding when I ran Mutect2.
That's a good question. I'll work on that. One issue that I have is that the original TMB data I generated was from GRCh38.d1.vd1 data from GDC. But, I think I can't use that reference with PureCN, and so I had get the original b37 aligned data. Since I have only run Mutect2 on those and didn't run vcf2maf or anything like that, I don't have a super clean comparison set to answer the question.
Let me see what I can dig up. It's the usual problem of translating across these different reference genomes. I'm betting there's an issue with my variant calling and I need to refine this a bit.
There shouldn't be an issue with GRCh38, we use it now exclusively. I remember some related complaints a while ago, it might have been an issue where people used GRCh38 as genome version instead of hg38 (to get the Bioconductor annotation packages).
And yeah, add the interval padding if possible. Most SNPs are in the flanking regions and that helps the segmentation and LOH calling significantly.
Describe the issue When running FilterCallableLoci, I get the following error:
It seems like a possible mismatch between the chromosome names in TxDb.Hsapiens.UCSC.hg19.knownGene, but I looked at the internals and it seems like the names should match:
I've also tried it with and without the header, but keep getting the same error.
To Reproduce I used the 'Callstate' package to generate the callable regions BED file (this is a replacement for GATK CallableLoci since it is no longer available in GATK4), and then ran
FilterCallableLoci.R
according to the manual:I've attached a copy of the input BED file in case it's helpful. callable_fixed.bed.gz
I don't have any other logs or anything to provide. Also, the version I'm running is the latest dev version (v1.23.1). I did try with v1.21.(?) and got the same error, so I upgraded and tried again hoping it was a bug that was fixed. However that does not seem to change things.