nf-cmgg / structural

A bioinformatics best-practice analysis pipeline for calling structural variants (SVs), copy number variants (CNVs) and repeat region expansions (RREs) from short DNA reads
https://nf-cmgg.github.io/structural/
MIT License
18 stars 3 forks source link

Standardize variants for callers of the same type #59

Closed nvnieuwk closed 6 months ago

nvnieuwk commented 9 months ago

Description of feature

Not all callers report all variants in the same way. This needs some standardization.

Some links for inspiration:

nvnieuwk commented 9 months ago

Jasmine merging and normalization can be improved by following the steps in this pipeline: https://github.com/mkirsche/Jasmine/tree/master/pipeline

nvnieuwk commented 9 months ago

Overview of work to do:

nvnieuwk commented 9 months ago

After a day of searching I could't really find a good resource to standardize VCF files that was easily available. I'm afraid we'll have to make our own tooling for this

mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the SV callers:

  1. DELLY
##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference pairs">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant pairs">
##FORMAT=<ID=RR,Number=1,Type=Integer,Description="# high-quality reference junction reads">
##FORMAT=<ID=RV,Number=1,Type=Integer,Description="# high-quality variant junction reads">

available as

##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=DR,DV
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=RR,RV
  1. SMOOVE
##FORMAT=<ID=PE,Number=1,Type=Integer,Description="Number of paired-end reads supporting the variant">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Number of split reads supporting the variant">

available as

##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
with PE=.,PE
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
with SR=.,SR
  1. MANTA
    ##FORMAT=<ID=PR,Number=.,Type=Integer,Description="Spanning paired-read support for the ref and alt alleles in the order listed">
    ##FORMAT=<ID=SR,Number=.,Type=Integer,Description="Split reads for the ref and alt alleles in the order listed, for reads where P(allele|read)>0.999">

    available as

    ##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Paired-read support for the ref and alt alleles in the order listed">
    with PE=PR
    ##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split-read support for the ref and alt alleles in the order listed">
    with SR=SR

UPDATE

after merge a new standardisation:

##FORMAT=<ID=PE_meanRef,Number=1,Type=Integer,Description="mean Paired-read support for the ref allele over all callers">
##FORMAT=<ID=PE_meanAlt,Number=1,Type=Integer,Description="mean Paired-read support for the alt allele over all callers">
  1. DELLY

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

  1. SMOOVE

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

  1. MANTA

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

available as

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

nvnieuwk commented 9 months ago

Awesome thank you! I'll have a better look when I'm implementing that in svync :)

mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the SV callers:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

available as

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
with SVLEN=|END-POS| or max(|END-POS|,|END2-POS2|) or SVLEN=INSLEN for insertions
  1. SMOOVE
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

available as

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
  1. MANTA
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=BND,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=DUP:TANDEM,Description="Tandem duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=CNV,Description="Copy number variable region">
mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the repeat caller:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=RU,Number=1,Type=String,Description="The repeat unit of the STR">
##INFO=<ID=REPREF,Number=1,Type=Integer,Description="How many repeats are in the reference">
##INFO=<ID=RL,Number=1,Type=Integer,Description="The repeat length in basepairs">
##INFO=<ID=REPID,Number=1,Type=String,Description="The ID of the repeat">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="The mean copy number in this repeat expansion">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="The confidence interval for the size of the expanded allele">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="The amount of spanning reads consistent with the repeat allele">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="The amount of flanking reads that overlap the repeat allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="The amount of in-repeat reads that are consistent with the repeat allele">

available as

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=RU,Number=1,Type=String,Description="The repeat unit of the variant">
##INFO=<ID=REPREF,Number=1,Type=Integer,Description="Number of repeats in the reference">
##INFO=<ID=RL,Number=1,Type=Integer,Description="The repeat length in basepairs">
##INFO=<ID=REPID,Number=1,Type=String,Description="The ID of the repeat">
##FORMAT=<ID=REPCN,Number=1,Type=String,Description="The mean number of repeats">
##FORMAT=<ID=REPCI,Number=1,Type=String,Description="The confidence interval for the number of repeats">
##FORMAT=<ID=ADSP,Number=1,Type=String,Description="Spanning read support consistent with the repeat allele">
##FORMAT=<ID=ADFL,Number=1,Type=String,Description="Flanking read support that overlap the repeat allele">
##FORMAT=<ID=ADIR,Number=1,Type=String,Description="In-repeat read support consistent with the repeat allele">

You could consider adding SVLEN, but as a FORMAT field, because this can vary for every sample,

##FORMAT=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
with SVLEN=len(RU) x REPCN

or as an INFO field, if you make a comma separated list of all SVLENs for all the ALT alleles in the merged vcf

##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
with SVLEN=c(len(RU) x REPCN)
mvheetve commented 9 months ago

I've started looking at the fields that we should include (at least for a first release) for the CNV callers:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="The type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="The total length of the structural variant">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number">
mvheetve commented 9 months ago

For the ID field I suggest the following formula:

Caller_SVTYPE_0000001 Caller_SVTYPE_0000002 Caller_SVTYPE_0000003

We could also consider a more specific ID like Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?

nvnieuwk commented 9 months ago

For the ID field I suggest the following formula:

Caller_SVTYPE_0000001 Caller_SVTYPE_0000002 Caller_SVTYPE_0000003

We could also consider a more specific ID like Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?

IMO it will be easier to filter on the position and info fields instead of on the ID. Also how will we handle variants that are found by multiple callers and then merged by jasmine?

mvheetve commented 9 months ago

For the ID field I suggest the following formula: Caller_SVTYPE_0000001 Caller_SVTYPE_0000002 Caller_SVTYPE_0000003 We could also consider a more specific ID like Caller_SVTYPE_CHROM_POS(2)_END(2)_REF_ALT that can then be used to list and filter recurrent false positive calls. This type of ID does not necessarily need to be stored in the vcf, we can calculate this on the fly in the filter step as well. Your thoughts?

IMO it will be easier to filter on the position and info fields instead of on the ID. Also how will we handle variants that are found by multiple callers and then merged by jasmine?

Weren't we going to include an INFO field that lists all the callers where the variant was found?

For the recurrent false positives, how do you envision filtering with INFO fields?

nvnieuwk commented 9 months ago

Yes but will we use another Caller ID or use a list of all callers in the ID? We could filter on positions and SVTYPE? there always is some uncertainty for SVs so a simple position filter won't filter out everything I think

nvnieuwk commented 6 months ago

Added in #72