Standardize variants for callers of the same type

nvnieuwk commented 9 months ago

Description of feature

Not all callers report all variants in the same way. This needs some standardization.

Some links for inspiration:

nvnieuwk commented 9 months ago

Jasmine merging and normalization can be improved by following the steps in this pipeline: https://github.com/mkirsche/Jasmine/tree/master/pipeline

nvnieuwk commented 9 months ago

Overview of work to do:

[ ] ALT fields
- [ ] Set all ALT fields to their corresponding SV (e.g. <DUP>, <DEL>...) => Get this from the INFO/SVTYPE field
- [ ] Move insertion sequences to the INFO fields
[ ] FILTER fields
- [ ] Streamline the requirements for filters to pass
[ ] QUAL fields
- [ ] Streamline the quality
[ ] INFO
- [ ] What can we keep and what is really necessary per variant type
- [ ] INS
- [ ] DEL
- [ ] DUP
- [ ] CNV
- [ ] TRA
- [ ] BND
- [ ] Complex variants
- [ ] Add the used caller (already done for CNV)
- [ ] ...
[ ] FORMAT + sample
- [ ] Standardize genotype
- [ ] What to do with the CNV genotyping?
- [ ] Gridss genotyping?
- [ ] Is all genotyping done correctly and standardized between all callers?
- [ ] What other fields are necessary per variant type
- [ ] INS
- [ ] DEL
- [ ] DUP
- [ ] CNV
- [ ] TRA
- [ ] BND
- [ ] Complex variants
[ ] How to handle BND and TRA variants. (breakpoint notation vs breakend notation) => This should be the same across all variants. So conversion needs to be done to one of the notations. Breakpoint is used the most right now, but I think breakend notation will be used more in the future. So maybe add the option for both?

nvnieuwk commented 9 months ago

After a day of searching I could't really find a good resource to standardize VCF files that was easily available. I'm afraid we'll have to make our own tooling for this

mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the SV callers:

Paired and split read support: The following FORMAT fields should be available in the merged vcf as FORMAT fields:

DELLY

##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference pairs">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant pairs">
##FORMAT=<ID=RR,Number=1,Type=Integer,Description="# high-quality reference junction reads">
##FORMAT=<ID=RV,Number=1,Type=Integer,Description="# high-quality variant junction reads">

available as

##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=DR,DV
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=RR,RV

SMOOVE

##FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end reads supporting the variant">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads supporting the variant">

available as

##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=.,PE
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=.,SR

MANTA

##FORMAT=<ID=PR,Number=.,Type=Integer,Description="Spanning paired-read support for the ref and alt alleles in the order listed">
##FORMAT=<ID=SR,Number=.,Type=Integer,Description="Split reads for the ref and alt alleles in the order listed, for reads where P(allele|read)>0.999">

available as

##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=PR
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=SR

UPDATE

after merge a new standardisation:

##FORMAT=<ID=PE_meanRef,Number=1,Type=Integer,Description="mean Paired-read support for the ref allele over all callers">
##FORMAT=<ID=PE_meanAlt,Number=1,Type=Integer,Description="mean Paired-read support for the alt allele over all callers">

genotype: The following FORMAT fields should be available in the merged vcf as an FORMAT field:

DELLY

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

SMOOVE

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

MANTA

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

available as

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

nvnieuwk commented 9 months ago

Awesome thank you! I'll have a better look when I'm implementing that in svync :)

mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the SV callers:

The following INFO fields should be available in the merged vcf as an INFO field:
1. DELLY

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

available as

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
with SVLEN=|END-POS| or max(|END-POS|,|END2-POS2|) or SVLEN=INSLEN for insertions

SMOOVE

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

available as

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

MANTA

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

The following ALT fields should be available in the merged vcf as an ALT field:

##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=BND,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=DUP:TANDEM,Description="Tandem duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=CNV,Description="Copy number variable region">

mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the repeat caller:

The following fields should be available in the merged vcf:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=RU,Number=1,Type=String,Description="The repeat unit of the STR">
##INFO=<ID=REPREF,Number=1,Type=Integer,Description="How many repeats are in the reference">
##INFO=<ID=RL,Number=1,Type=Integer,Description="The repeat length in basepairs">
##INFO=<ID=REPID,Number=1,Type=String,Description="The ID of the repeat">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="The mean copy number in this repeat expansion">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="The confidence interval for the size of the expanded allele">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="The amount of spanning reads consistent with the repeat allele">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="The amount of flanking reads that overlap the repeat allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="The amount of in-repeat reads that are consistent with the repeat allele">

available as

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=RU,Number=1,Type=String,Description="The repeat unit of the variant">
##INFO=<ID=REPREF,Number=1,Type=Integer,Description="Number of repeats in the reference">
##INFO=<ID=RL,Number=1,Type=Integer,Description="The repeat length in basepairs">
##INFO=<ID=REPID,Number=1,Type=String,Description="The ID of the repeat">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="The mean number of repeats">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="The confidence interval for the number of repeats">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="Spanning read support consistent with the repeat allele">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="Flanking read support that overlap the repeat allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="In-repeat read support consistent with the repeat allele">

You could consider adding SVLEN, but as a FORMAT field, because this can vary for every sample,

##FORMAT=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
with SVLEN=len(RU) x REPCN

or as an INFO field, if you make a comma separated list of all SVLENs for all the ALT alleles in the merged vcf

##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
with SVLEN=c(len(RU) x REPCN)

mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the CNV callers:

The following fields should be available in the merged vcf:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="The type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="The total length of the structural variant">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number">

mvheetve commented 9 months ago

For the ID field I suggest the following formula:

Caller_SVTYPE_0000001 Caller_SVTYPE_0000002 Caller_SVTYPE_0000003

We could also consider a more specific ID like Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?

nvnieuwk commented 9 months ago

For the ID field I suggest the following formula:

Caller_SVTYPE_0000001 Caller_SVTYPE_0000002 Caller_SVTYPE_0000003

We could also consider a more specific ID like Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?

IMO it will be easier to filter on the position and info fields instead of on the ID. Also how will we handle variants that are found by multiple callers and then merged by jasmine?

mvheetve commented 9 months ago

For the ID field I suggest the following formula: Caller_SVTYPE_0000001 Caller_SVTYPE_0000002 Caller_SVTYPE_0000003 We could also consider a more specific ID like Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?

IMO it will be easier to filter on the position and info fields instead of on the ID. Also how will we handle variants that are found by multiple callers and then merged by jasmine?

Weren't we going to include an INFO field that lists all the callers where the variant was found?

For the recurrent false positives, how do you envision filtering with INFO fields?

nvnieuwk commented 9 months ago

Yes but will we use another Caller ID or use a list of all callers in the ID? We could filter on positions and SVTYPE? there always is some uncertainty for SVs so a simple position filter won't filter out everything I think

nvnieuwk commented 6 months ago

Added in #72

nf-cmgg / structural

Standardize variants for callers of the same type #59

Description of feature

UPDATE