Closed DarioS closed 5 years ago
Sorry. This was my mistake.
You are right that Strelka2 (and strelka1) does not provide an AD field, although most somatic point mutations do. We do use Strelka but we have some post processing steps in our pipeline prior prior to parsing in PURPLE.
We can make you a simple java process which annotates the AD field to the vcf if you like.
These are the fields in Strelka that are relevant (note - different conventions for INDELs and SNVs):
here is some example python code which shows the logic for our AD convention:
class StrelkaParser(object): def ad(self, allele_num, allele, call): if not hasattr(call.data, "TAR"): snp_field_name = "{}U".format(allele) return getattr(call.data, snp_field_name)[0] if hasattr(call.data, snp_field_name) else None elif allele_num == 0: return call.data.TAR[0] elif len(allele) > 1 or len(call.site.REF) > 1: return call.data.TIR[0] else: return None
I forgot about the existence of the somatic indels file. Do you typically merge the SNV and indels files into a single VCF file before inputting to PURPLE? The VCF header of the somatic file has ##content=strelka somatic snv calls
and the other file has ##content=strelka somatic indel calls
. Could those be flags to trigger automatic calculation of AD in the software without needing any permanent modification of the VCF files?
Yes. We do also merge both files together.
Let me get back to you about a proper solution.
Hi Dario,
As the Strelka fields are non-standard I have elected not to support them directly in PURPLE. Instead I have created a tool (located in the PURPLE jar) that will generate the AD fields for you.
Note that it annotates SNPs and INDELS but will not override any existing AD values.
Example Usage:
java -Xmx4G -cp purple.jar com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth -in strelka.vcf -out strelka.annotated.vcf
You will need to use the latest version of PURPLE (2. 2) to access the tool.
I have also updated the PURPLE readme to include a section on preparing Strelka output for PURPLE.
Regards, Jonathan
It's not straightforward because of Strelka's inflexible output, but it seems easier with the new tool provided.
Do we still have this feature in the latest versions of Purple (> 3.6). In my hands, Purple_v3.6 is the latest version that produces a vcf output :
_java -Xmx4G -cp purplev3.6.jar com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth -in $file -out $file.ad.vcf
shall I use any other recent versions (3.7, 3.8, 3.9), the error that I am getting is : Error: Could not find or load main class com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth Caused by: java.lang.ClassNotFoundException: com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth
This functionality no longer exists in Purple. I now only works with variant calling by Sage and Pave, which we recommend instead of Strelka and SnpEff.
Thank you for your prompt reply. I would love to use Hartwig software suite, however, I would need to show my colleagues comparisons between Sage and Strelka2 or Mutect2, or between Purple and other CNV callers, such as Facets or Sclust. Which articles shall I use in order to substantiate my preference towards Sage / Purple, Amber or Cobalt ? Thank you :)
That Strelka tool simply annotated an 'AD' field which is required by PURPLE
We don't have a formal comparison article for SAGE vs Strelka &Mutect, sorry, so you will need to do that yourself. We initially used a consensus caller which incorporated Strelka and Mutect. The main reason we built SAGE were to reduce the FP rate and to allow tiered calling. Some of the other features of Sage are highligheted in the intro on the github page: https://github.com/hartwigmedical/hmftools/tree/master/sage#somatic-alterations-in-genome-sage
For PURPLE, we performed some copy number validations on an early version here: https://www.nature.com/articles/s41586-019-1689-y (see supplementary information section 10)
Thank you so much. It has been very informative. On a separate note, may I ask for a suggestion regarding variant filtering ? I have used Purple to add AD to a Strelka2 VCF file, and a bcftools plugin to add VAF. Subsequently, the VCF file has the following parameters :
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 54720 . C CTT . PASS IC=5;IHP=6;MQ=54.6;MQ0=1;NT=ref;QSI=40;QSI_NT=40;RC=3;RU=T;SGT=ref>het; SOMATIC;SomaticEVS=17.29;TQSI=2;TQSI_NT=2
AD:BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR:VAF
34,0:0.06:29:29:36.35:2.57:0:34,40:0,0:4,0:0 [ NORMAL]
3,3:0:3:3:4.56:0:0:3,4:3,3:0,1:0.5 [TUMOR]
I would like to filter the VCF file based on AD and VAF fields in the "FORMAT" section of TUMOR and NORMAL.
In the example above, in the NORMAL sample, AD is : 34,0 : where 34 is the AD of the REF allele, and 0 is the AD of the ALT allele. In the TUMOR sample , AD is 3,3, where 3 is the AD of the REF allele, and 3 is the AD of the ALT allele.
Talking about VAF, in the NORMAL sample : VAF is 0, and in the TUMOR sample : VAF is 0.5.
Shall I need to filter this VCF file based on criteria such as :
AD of ALT allele in the TUMOR sample to be > 2, and VAF in the TUMOR sample to be > 0.3
which method would you suggest ? Thank you very much.
You can filter using bcftools or other similar vcf handling tool packages. See, for example: https://samtools.github.io/bcftools/howtos/filtering.html
Thanks so much. I do not know how to write the filtering expression in order to be compatible with bcftools.
Please could you help me on this, if possible. Thanks so much.
Bogdan
Sorry. We do not support bcftools. You will need to go to the bcftools documentation or github for that
You mention that you use Strelka. I use Strelka2, published in Nature Methods last year. It doesn't have an AD field. Rather, the number of reads for each allele looks like
The meaning of each field is described on a website. It results in an error