hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
198 stars 59 forks source link

Strelka2 PURPLE Compatiblity #31

Closed DarioS closed 5 years ago

DarioS commented 5 years ago

If you wanted to improve confidence further you could supply a somatic SNV vcf to PURPLE. We use Strelka for this, but it should work as long as your vcf contains the AD field which most do.

You mention that you use Strelka. I use Strelka2, published in Nature Methods last year. It doesn't have an AD field. Rather, the number of reads for each allele looks like

DP:FDP:SDP:SUBDP:AU:CU:GU:TU    12:0:21:0:0,0:0,1:1,25:11,53

The meaning of each field is described on a website. It results in an error

Exception in thread "main" java.lang.IllegalArgumentException: Allelic depths is a required format field in vcf file

p-priestley commented 5 years ago

Sorry. This was my mistake.

You are right that Strelka2 (and strelka1) does not provide an AD field, although most somatic point mutations do. We do use Strelka but we have some post processing steps in our pipeline prior prior to parsing in PURPLE.

We can make you a simple java process which annotates the AD field to the vcf if you like.

These are the fields in Strelka that are relevant (note - different conventions for INDELs and SNVs):

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

here is some example python code which shows the logic for our AD convention:

class StrelkaParser(object): def ad(self, allele_num, allele, call): if not hasattr(call.data, "TAR"): snp_field_name = "{}U".format(allele) return getattr(call.data, snp_field_name)[0] if hasattr(call.data, snp_field_name) else None elif allele_num == 0: return call.data.TAR[0] elif len(allele) > 1 or len(call.site.REF) > 1: return call.data.TIR[0] else: return None

DarioS commented 5 years ago

I forgot about the existence of the somatic indels file. Do you typically merge the SNV and indels files into a single VCF file before inputting to PURPLE? The VCF header of the somatic file has ##content=strelka somatic snv calls and the other file has ##content=strelka somatic indel calls. Could those be flags to trigger automatic calculation of AD in the software without needing any permanent modification of the VCF files?

p-priestley commented 5 years ago

Yes. We do also merge both files together.

Let me get back to you about a proper solution.

jonbaber commented 5 years ago

Hi Dario,

As the Strelka fields are non-standard I have elected not to support them directly in PURPLE. Instead I have created a tool (located in the PURPLE jar) that will generate the AD fields for you.

Note that it annotates SNPs and INDELS but will not override any existing AD values.

Example Usage: java -Xmx4G -cp purple.jar com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth -in strelka.vcf -out strelka.annotated.vcf

You will need to use the latest version of PURPLE (2. 2) to access the tool.

I have also updated the PURPLE readme to include a section on preparing Strelka output for PURPLE.

Regards, Jonathan

DarioS commented 5 years ago

It's not straightforward because of Strelka's inflexible output, but it seems easier with the new tool provided.

tanasa commented 1 year ago

Do we still have this feature in the latest versions of Purple (> 3.6). In my hands, Purple_v3.6 is the latest version that produces a vcf output :

_java -Xmx4G -cp purplev3.6.jar com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth -in $file -out $file.ad.vcf

shall I use any other recent versions (3.7, 3.8, 3.9), the error that I am getting is : Error: Could not find or load main class com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth Caused by: java.lang.ClassNotFoundException: com.hartwig.hmftools.purple.tools.AnnotateStrelkaWithAllelicDepth

charlesshale commented 1 year ago

This functionality no longer exists in Purple. I now only works with variant calling by Sage and Pave, which we recommend instead of Strelka and SnpEff.

tanasa commented 1 year ago

Thank you for your prompt reply. I would love to use Hartwig software suite, however, I would need to show my colleagues comparisons between Sage and Strelka2 or Mutect2, or between Purple and other CNV callers, such as Facets or Sclust. Which articles shall I use in order to substantiate my preference towards Sage / Purple, Amber or Cobalt ? Thank you :)

p-priestley commented 1 year ago

That Strelka tool simply annotated an 'AD' field which is required by PURPLE

We don't have a formal comparison article for SAGE vs Strelka &Mutect, sorry, so you will need to do that yourself. We initially used a consensus caller which incorporated Strelka and Mutect. The main reason we built SAGE were to reduce the FP rate and to allow tiered calling. Some of the other features of Sage are highligheted in the intro on the github page: https://github.com/hartwigmedical/hmftools/tree/master/sage#somatic-alterations-in-genome-sage

For PURPLE, we performed some copy number validations on an early version here: https://www.nature.com/articles/s41586-019-1689-y (see supplementary information section 10)

tanasa commented 1 year ago

Thank you so much. It has been very informative. On a separate note, may I ask for a suggestion regarding variant filtering ? I have used Purple to add AD to a Strelka2 VCF file, and a bcftools plugin to add VAF. Subsequently, the VCF file has the following parameters :

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR

chr1 54720 . C CTT . PASS IC=5;IHP=6;MQ=54.6;MQ0=1;NT=ref;QSI=40;QSI_NT=40;RC=3;RU=T;SGT=ref>het; SOMATIC;SomaticEVS=17.29;TQSI=2;TQSI_NT=2

AD:BCN50:DP:DP2:DP50:FDP50:SUBDP50:TAR:TIR:TOR:VAF

34,0:0.06:29:29:36.35:2.57:0:34,40:0,0:4,0:0 [ NORMAL]

3,3:0:3:3:4.56:0:0:3,4:3,3:0,1:0.5 [TUMOR]

I would like to filter the VCF file based on AD and VAF fields in the "FORMAT" section of TUMOR and NORMAL.

In the example above, in the NORMAL sample, AD is : 34,0 : where 34 is the AD of the REF allele, and 0 is the AD of the ALT allele. In the TUMOR sample , AD is 3,3, where 3 is the AD of the REF allele, and 3 is the AD of the ALT allele.

Talking about VAF, in the NORMAL sample : VAF is 0, and in the TUMOR sample : VAF is 0.5.

Shall I need to filter this VCF file based on criteria such as :

AD of ALT allele in the TUMOR sample to be > 2, and VAF in the TUMOR sample to be > 0.3

which method would you suggest ? Thank you very much.

p-priestley commented 1 year ago

You can filter using bcftools or other similar vcf handling tool packages. See, for example: https://samtools.github.io/bcftools/howtos/filtering.html

tanasa commented 1 year ago

Thanks so much. I do not know how to write the filtering expression in order to be compatible with bcftools.

Please could you help me on this, if possible. Thanks so much.

Bogdan

p-priestley commented 1 year ago

Sorry. We do not support bcftools. You will need to go to the bcftools documentation or github for that