lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

FilterCallableLoci Error with seqlevelsStyle #182

Closed drmrgd closed 3 years ago

drmrgd commented 3 years ago

Describe the issue When running FilterCallableLoci, I get the following error:

$ Rscript $PURECN/FilterCallableLoci.R --genome hg19 --infile callable_fixed.bed  --outfile callable_fixed_cds.bed  --exclude '^HLA'
INFO [2021-05-27 09:05:05] Loading TxDb.Hsapiens.UCSC.hg19.knownGene...
Error in seqlevelsStyle(sl) :
  The style does not have a compatible entry for the species supported by
  Seqname. Please see genomeStyles() for supported species/style
Calls: <Anonymous> -> seqlevelsStyle -> seqlevelsStyle
Execution halted

It seems like a possible mismatch between the chromosome names in TxDb.Hsapiens.UCSC.hg19.knownGene, but I looked at the internals and it seems like the names should match:

# From FilterCallableLoci.R
INFO [2021-05-27 09:02:49] Loading TxDb.Hsapiens.UCSC.hg19.knownGene...
 [1] chr1                  chr2                  chr3
 [4] chr4                  chr5                  chr6
 [7] chr7                  chr8                  chr9
[10] chr10                 chr11                 chr12
[13] chr13                 chr14                 chr15
[16] chr16                 chr17                 chr18
[19] chr19                 chr20                 chr21
[22] chr22                 chrX                  chrY
[25] chr1_gl000191_random  chr1_gl000192_random  chr4_ctg9_hap1
[28] chr4_gl000194_random  chr6_apd_hap1         chr6_cox_hap2
[31] chr6_dbb_hap3         chr6_mann_hap4        chr6_mcf_hap5
[34] chr6_qbl_hap6         chr6_ssto_hap7        chr7_gl000195_random
[37] chr9_gl000201_random  chr17_ctg5_hap1       chr17_gl000204_random
[40] chr19_gl000209_random chrUn_gl000212        chrUn_gl000213
[43] chrUn_gl000218        chrUn_gl000219        chrUn_gl000222
[46] chrUn_gl000223        chrUn_gl000228
93 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrUn_gl000249
Error:
Execution halted

# From my input file
$ cut -f1 callable_fixed.bed | sort -Vu
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
#chrom

I've also tried it with and without the header, but keep getting the same error.

To Reproduce I used the 'Callstate' package to generate the callable regions BED file (this is a replacement for GATK CallableLoci since it is no longer available in GATK4), and then ran FilterCallableLoci.R according to the manual:

Rscript $PURECN/FilterCallableLoci.R --genome hg19 --infile callable_fixed.bed  --outfile callable_fixed_cds.bed  --exclude '^HLA'

I've attached a copy of the input BED file in case it's helpful. callable_fixed.bed.gz

I don't have any other logs or anything to provide. Also, the version I'm running is the latest dev version (v1.23.1). I did try with v1.21.(?) and got the same error, so I upgraded and tried again hoping it was a bug that was fixed. However that does not seem to change things.

lima1 commented 3 years ago

Thanks Dave, I'll have a look later today. If you remove the header from your callable file, it doesn't fix it, right?

drmrgd commented 3 years ago

Yeah, that's correct, Markus. I've tried all combinations (with and without header, with and without chr string) and all iterations seem to throw the same error.

Edit I think I may have figured this out. I think it could be that the extra columns in the Callstate output may be causing some consternation. The head of the Callstate file looks like:

$ head callable_no-chr_no-header.bed
1       13342   13350   CALLABLE        961.25  386.25  452.88
1       13448   13527   CALLABLE        3530.61 2250.28 1120.23
1       13508   13531   CALLABLE        3569.65 2083.22 1365.00
1       69404   69438   CALLABLE        847.03  572.65  272.65
1       69418   69497   CALLABLE        782.68  653.05  127.57
1       69498   69577   CALLABLE        932.62  843.80  86.20
1       69549   69628   CALLABLE        977.66  837.92  137.18
1       324759  324816  CALLABLE        959.33  698.81  257.56
1       664702  664724  CALLABLE        1110.91 641.82  406.77
1       664795  664801  CALLABLE        570.33  269.50  256.50

I did some digging, and I think the columns in CallableLoci don't include the last 3. If I trim those cols out, it seems to work:

INFO [2021-05-27 10:40:29] Loading TxDb.Hsapiens.UCSC.hg19.knownGene...
Loading required package: org.Hs.eg.db

WARN [2021-05-27 10:40:37] Chromosome naming style of txdb file (UCSC) was different from interval file (Ensembl).
WARN [2021-05-27 10:40:46] Attempted adding gene symbols to intervals. Heuristics have been used to pick symbols for overlapping genes.
INFO [2021-05-27 10:40:46] Excluded region of size 12307bp.
INFO [2021-05-27 10:40:48] Total size of CDS region: 21.49Mb (27.93Mb input).

So, it seems like the extra columns may be the root cause of the issue.

lima1 commented 3 years ago

Great, yes, rtracklayer is used to parse the BED files and can get quite easily confused.

I'll check out callstate. mosdepth is I think another one that could be a good alternative to GATK3. I didn't spend too much time on it since it doesn't make a huge difference for high coverage data, ie. sample to sample variance is low, so you might as well use the baits file directly.

drmrgd commented 3 years ago

It's definitely a nominal difference in at least one sample I've been playing with. So, maybe you're right and filtering to callable regions isn't completely necessary? Trying to make sure all of this is correct because I'm not getting the correct output for TMB in these samples, which I know to be TMB-high in most cases.

Also, to be sure, the resultant file from FilterCallableLoci.R looks like this:

chr1    13448   13531   .       0       .
chr1    69404   69497   .       0       .
chr1    69498   69628   .       0       .
chr1    324759  324816  .       0       .
chr1    664702  664724  .       0       .
chr1    664795  664801  .       0       .
chr1    762160  762239  .       0       .
chr1    762250  762329  .       0       .
chr1    762410  762546  .       0       .
chr1    861321  861393  .       0       .

I do see a comment in the script stdout that suggests that there is a mapping to gene symbols. But, I don't see them in the output; is this correct or is there still an issue?

lima1 commented 3 years ago

Should be fine. All it does is removing non-coding regions from the input BED file. Let me know if you find an issue here, but haven't touched that code in a long time and seems to work well.

lima1 commented 3 years ago

Feel free to post a log file of the PureCN.R run for a high TMB sample that is called as low TMB. Might point you to setup issues in case there are some.

drmrgd commented 3 years ago

Thanks Markus! Happy to open a new issue if it's more appropriate (since this one is now resolved).

I just ran Dx.R, and did get one warning for the mutational signatures piece, but nothing for any other component:

$ Rscript $PURECN/Dx.R --rds MATCH-Z1D-B44E-TDNA.rds --callable MATCH-Z1D-B44E_callable_filtered_cds.bed --exclude /data/Essex/reference/hg19_simpleRepeats_noChr.bed --signatures --force
INFO [2021-05-27 13:07:17] Reading /gpfs/gsfs12/users/Essex/EAY131-Z1D/orig_hg19_bams/purecn_out_20210525/MATCH-Z1D-B44E-TDNA/MATCH-Z1D-B44E_callable_filtered_cds.bed...
INFO [2021-05-27 13:07:18] Reading /gpfs/gsfs12/users/Essex/reference/hg19_simpleRepeats_noChr.bed...
INFO [2021-05-27 13:07:20] Loading PureCN...
INFO [2021-05-27 13:07:20] Reading /gpfs/gsfs12/users/Essex/EAY131-Z1D/orig_hg19_bams/purecn_out_20210525/MATCH-Z1D-B44E-TDNA/MATCH-Z1D-B44E-TDNA.rds...
INFO [2021-05-27 13:07:22] Calling mutation burden...
INFO [2021-05-27 13:07:32] Calling chromosomal instability...
Loading required package: deconstructSigs
INFO [2021-05-27 13:07:42] deconstructSigs package found.
INFO [2021-05-27 13:07:42] Using signatures.exome.cosmic.v3.may2019 signature database.
WARN [2021-05-27 13:07:42] Not enough somatic calls to deconstruct signatures.

I'm a little surprised about that result as I'm quite certain there are a fairly large number of somatic calls in this specimen. Additionally, the result of the mutation burden output shows the somatic (and germline for that matter!) rate at 0, but my own TMB calculation for this specimen was 26.64 Mutations / Mb:

$ transpose.pl -d , MATCH-Z1D-B44E-TDNA_mutation_burden.csv
Sampleid,MATCH-Z1D-B44E-TDNA
somatic.ontarget,0
somatic.all,0
private.germline.ontarget,0
private.germline.all,0
callable.bases.ontarget,20268128
callable.bases.flanking,20286296
callable.bases.all,21034938
somatic.rate.ontarget,0
somatic.rate.ontarget.95.lower,0
somatic.rate.ontarget.95.upper,0.182003938322209
private.germline.rate.ontarget,0
private.germline.rate.ontarget.95.lower,0
private.germline.rate.ontarget.95.upper,0.182003938322209

I suspect there is an issue with my GATK4 Mutect2 VCF or something like that, but I'm not sure. The TMB calculation I made was from the MAF generated by the GDC somatic workflow with a few additional filters to remove non-coding alterations and the like. But, of course, what I'm calling and what PureCN calls somatic alterations could be different; I just didn't expect to be so different!

lima1 commented 3 years ago

That is odd. Can you share the output (log file) of the corresponding PureCN.R run?

drmrgd commented 3 years ago

Sure, no problem! I'm sure I've got something set up incorrectly or something. Here's that log: MATCH-Z1D-B44E-TDNA.log

It's also interesting that I do see a TMB value > 0 for three samples in my cohort. All three are tumor only samples that have no paired normal. I built a normal DB previously, and ran these all through (27 in total) using that normal DB with the hope that I can get similar enough data from the tumor only to the paired set for analysis. Something I'm doing wrong, though, is causing the data for the paired samples to have issues compared to the tumor only samples. Maybe that's a clue?

lima1 commented 3 years ago

No real red flags. The fraction of intervals with SNPs is low, you probably did not run Mutect with --interval_padding, right? But that does not explain the 0 germline/somatic rate.

When you overlap the somatic mutations in the MAF file with the Sampleid_variants.csv file, what do you get? Are these filtered out (present in input Mutect VCF, but not showing up in the variants file), or flagged?

drmrgd commented 3 years ago

No, I didn't run with interval padding when I ran Mutect2.

That's a good question. I'll work on that. One issue that I have is that the original TMB data I generated was from GRCh38.d1.vd1 data from GDC. But, I think I can't use that reference with PureCN, and so I had get the original b37 aligned data. Since I have only run Mutect2 on those and didn't run vcf2maf or anything like that, I don't have a super clean comparison set to answer the question.

Let me see what I can dig up. It's the usual problem of translating across these different reference genomes. I'm betting there's an issue with my variant calling and I need to refine this a bit.

lima1 commented 3 years ago

There shouldn't be an issue with GRCh38, we use it now exclusively. I remember some related complaints a while ago, it might have been an issue where people used GRCh38 as genome version instead of hg38 (to get the Bioconductor annotation packages).

lima1 commented 3 years ago

And yeah, add the interval padding if possible. Most SNPs are in the flanking regions and that helps the segmentation and LOH calling significantly.