Closed nvnieuwk closed 6 months ago
Jasmine merging and normalization can be improved by following the steps in this pipeline: https://github.com/mkirsche/Jasmine/tree/master/pipeline
Overview of work to do:
<DUP>
, <DEL>
...) => Get this from the INFO/SVTYPE
fieldAfter a day of searching I could't really find a good resource to standardize VCF files that was easily available. I'm afraid we'll have to make our own tooling for this
I've started looking at the fields that we should include (at least for a first release) for the SV callers:
FORMAT
fields should be available in the merged vcf as FORMAT
fields:##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference pairs">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant pairs">
##FORMAT=<ID=RR,Number=1,Type=Integer,Description="# high-quality reference junction reads">
##FORMAT=<ID=RV,Number=1,Type=Integer,Description="# high-quality variant junction reads">
available as
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=DR,DV
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=RR,RV
##FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end reads supporting the variant">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads supporting the variant">
available as
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=.,PE
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=.,SR
##FORMAT=<ID=PR,Number=.,Type=Integer,Description="Spanning paired-read support for the ref and alt alleles in the order listed">
##FORMAT=<ID=SR,Number=.,Type=Integer,Description="Split reads for the ref and alt alleles in the order listed, for reads where P(allele|read)>0.999">
available as
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=PR
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=SR
after merge a new standardisation:
##FORMAT=<ID=PE_meanRef,Number=1,Type=Integer,Description="mean Paired-read support for the ref allele over all callers">
##FORMAT=<ID=PE_meanAlt,Number=1,Type=Integer,Description="mean Paired-read support for the alt allele over all callers">
FORMAT
fields should be available in the merged vcf as an FORMAT
field:##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
available as
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
Awesome thank you! I'll have a better look when I'm implementing that in svync
:)
I've started looking at the fields that we should include (at least for a first release) for the SV callers:
INFO
fields should be available in the merged vcf as an INFO
field:
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
available as
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
with SVLEN=|END-POS| or max(|END-POS|,|END2-POS2|) or SVLEN=INSLEN for insertions
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
available as
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
ALT
fields should be available in the merged vcf as an ALT
field:##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=BND,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=DUP:TANDEM,Description="Tandem duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=CNV,Description="Copy number variable region">
I've started looking at the fields that we should include (at least for a first release) for the repeat caller:
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=RU,Number=1,Type=String,Description="The repeat unit of the STR">
##INFO=<ID=REPREF,Number=1,Type=Integer,Description="How many repeats are in the reference">
##INFO=<ID=RL,Number=1,Type=Integer,Description="The repeat length in basepairs">
##INFO=<ID=REPID,Number=1,Type=String,Description="The ID of the repeat">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="The mean copy number in this repeat expansion">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="The confidence interval for the size of the expanded allele">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="The amount of spanning reads consistent with the repeat allele">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="The amount of flanking reads that overlap the repeat allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="The amount of in-repeat reads that are consistent with the repeat allele">
available as
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=RU,Number=1,Type=String,Description="The repeat unit of the variant">
##INFO=<ID=REPREF,Number=1,Type=Integer,Description="Number of repeats in the reference">
##INFO=<ID=RL,Number=1,Type=Integer,Description="The repeat length in basepairs">
##INFO=<ID=REPID,Number=1,Type=String,Description="The ID of the repeat">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="The mean number of repeats">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="The confidence interval for the number of repeats">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="Spanning read support consistent with the repeat allele">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="Flanking read support that overlap the repeat allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="In-repeat read support consistent with the repeat allele">
You could consider adding SVLEN, but as a FORMAT
field, because this can vary for every sample,
##FORMAT=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
with SVLEN=len(RU) x REPCN
or as an INFO
field, if you make a comma separated list of all SVLENs for all the ALT alleles in the merged vcf
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
with SVLEN=c(len(RU) x REPCN)
I've started looking at the fields that we should include (at least for a first release) for the CNV callers:
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="The type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="The total length of the structural variant">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number">
For the ID field I suggest the following formula:
Caller_SVTYPE_0000001
Caller_SVTYPE_0000002
Caller_SVTYPE_0000003
We could also consider a more specific ID like
Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT
that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?
For the ID field I suggest the following formula:
Caller_SVTYPE_0000001
Caller_SVTYPE_0000002
Caller_SVTYPE_0000003
We could also consider a more specific ID like
Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT
that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?
IMO it will be easier to filter on the position and info fields instead of on the ID. Also how will we handle variants that are found by multiple callers and then merged by jasmine?
For the ID field I suggest the following formula:
Caller_SVTYPE_0000001
Caller_SVTYPE_0000002
Caller_SVTYPE_0000003
We could also consider a more specific ID likeCaller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT
that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?IMO it will be easier to filter on the position and info fields instead of on the ID. Also how will we handle variants that are found by multiple callers and then merged by jasmine?
Weren't we going to include an INFO
field that lists all the callers where the variant was found?
For the recurrent false positives, how do you envision filtering with INFO
fields?
Yes but will we use another Caller ID or use a list of all callers in the ID? We could filter on positions and SVTYPE? there always is some uncertainty for SVs so a simple position filter won't filter out everything I think
Added in #72
Description of feature
Not all callers report all variants in the same way. This needs some standardization.
Some links for inspiration:
bedgovcf