tjiangHIT / cuteSV

Long read based human genomic structural variation detection with cuteSV
MIT License
237 stars 33 forks source link

Malformed VCF due to `R` reference base #124

Open dennishendriksen opened 1 year ago

dennishendriksen commented 1 year ago

Hello @tjiangHIT, @Meltpinkg and cuteSV developers,

cuteSV v2.0.3 can produce malformed VCF output containing R nucleotides in the REF column. These are not allowed according to the VCF v4.2 specification: REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). The VCF v4.3 specification additionaly mentions: IUPAC ambiguity codes should be converted to a concrete base. Downstream tools such as HTSJDK throw an error correctly stating that the VCF is malformed.

For our use case this result in analysis that cannot complete.

Example:

chr10   131592457       cuteSV.DEL.1321 TCCCAGGTTCAAGTGATTCTCTTGCTTAACCCTCCCGAGTAGCTGGGATTACAGGCACCCACAAGAACACCCAGCTGATTTTTGTATTTTTAGCAGAGACAGGGTTTCACTGTGTTGGCCAGGCTGGTCTCGAACTCCTGACCTTGTGATCTGCCTGCCTTGGCCTCCCAAAGTACTGGGATTAATTATTTTTCCTTTTTAAGGTTAAATAATATTCCATTTTGTGGATATGCCACATTTTGTTTATCCATTCATCTGTCAACAGACACTTGGGTTGCTTCCATCTTTTGACTATTGTGAATAATGCTGTTGTGGACATGGGTGTAGAAACATCTCTTTGAGGCTCTGCTTTTAATTCTTTGAGGTATATACCCAGAGGTGTAATTGCTGGATCATGTGAAATCTGAGAAACCACCATATTGTTTCTATAGTTGTGTAGTATCTCACTGTGGTTTTGATTTGCATTTTCCTAATTATTCATGTTGTTGAGCATCTTTTCATGTACTTATTGGTCATTTGTATATCATTGGAGAAATATATATTCAAGTCCTTTGTCTATTTTTTAATTGTGTTGTTTTTTGGTTGTTGAATTGCAAGAGTTCTTTATATATGGATAGTAATCCGTTATCAGATATATAATTTACAAATATTTCCTGCCATTCAGTGTGTTGCCTTTTACTCTGTTGACAGTGTCATTTGATTCACAAAAATTTTTAATATTTACATGTTCCAATTATCTGATTTTTTTGTTGCCTATGCTTTCGGTGTCGTAGCCAAGAAATCCTTGCCAAATGCAATGCCATGAAGCTGTGCCCCTACATTTTCTTGTGAGTATTCTAACTCTCATATCTAAGTCTTTGACTATTTTTAATTTCTGCATATGGTGTAAGGTAAGGGTACAACTTCATTCTTTTGCATGTGGCTATCCAGTTTTCCCAGTAACATTTGTTGAAAAGACTGTCCTTTTCCCTATTGGATAGTCCTAGCAACTTTTTAAAAAATCACAAGGCCATATATACAAGAGTTTATTTCTGGGCTCTCTATTCTATCTCACTGATCTATGTGTCTGTCTATACGTCAATACCACTCTGTTTTTAATACTGTAGATTTTTAGAAATTTTGAAACTAAGAAGTGTGAGACCTCCAACTGTGTTCTTTTTCAAGATTGTTTTTGCTATTTAGGGTCCCTTGAGATTCTATATGAATGTTAGGATAGATTTTTCTAGTTTTGTAAAAAAAAATTGATGTTGGAATTTTAAGATAAATTGCATTTAATCTAGAGACCACATCTTTCAATTTTAGGTCTTCTCATCTATGAACAAAGGATGTCTATTTTTGTAGTGTCTTTAATTTCTTTGAGCAATATTTCATAGTTTTCAGTGTACACATCTTTCACCTCCTTGGTTCAGTTTGTTTCTATTTTTTATTTTGTTTGGTCCCACTTTAAATGAAATTGCTTTCTTAATTTCTTTTTCAGGTTGTTCATTGTTATTGTATAGAAACACAGCTAATTTCTGTATGCTGAGTATTCTGTAAGTTTGCTAATTTTGTTATTAGTTCTATCATGTTTCTTATGGAATCTTTGGGGTTTTCTACATATGAAATTACATCATCTATGAAAGGGATCGTTTTACTTTTTATTTCCCAATTTTAATGCTTTTTATTTCCTAATTTATCTGGTCAAGATTTCCATTACTATGCTGAATTTAAAAGTAGGCATTCTTCCCTTGTGTCTTAGCTTAGAAGAAAAGTTTTCAATCTTTCATCATTAAGTATGATGTTAGCAATGGGCTTTCCATATATGGCCTTAATTATGTTGAGGTAGTTTCCTTCTGTTCCTAGTTTGGTGRATGTTTTTTATCATGGAAAGGTGTTGGATTTTGTCAAATATTTTTCTCCATCAATTGAGATGATCACATGGGAACTGTTTCTTCATTCTGTTAATGTAGTTATTACATTAATTCATTTTCATATGTTGAACTATCCTTGAATTTCAGAAATAAATCCCACGAGGTCATGTGTATAATTTTTTTGATGTGTCACTTAATTCTGTTCACTAATATTTGGTTGAGGATTTTTACATCAGTATTTATCAGAGATATTGATCTGTAGCTTAATTTTATTGTAGTACCTTTGTCTTGCTTTGGTGAAAGAGTAATCTTGGCCTTGAAGAATAAGTTTGAAAGTGTCCCCTTACCTTAAACTTTTTTGGAAACTTTTGAGAAGGATTAGTGTTAACTCTTCTTTAAATGTTTGGTAGAATTCACGAATGAAGCCATCAGCTCCTGGGATTTTCTTTGTTGGCAGATTTTGGATCATTGATTCAATCTCTTTGCTAGTTATATGTCTGTTCGTATTTTCTATTTCTTTGTGGKTTAGTCTTGGTAGGTGGTATATGTCTAGGAATTTATCCATTTTGTCTAGGTTGTCCAATTTTTTGGCATACAAATATTCATACTATTGTCTTATTAATATAATCATTTTATTTCTGTTAAATCAGTGGTAATGTCTGCACTTACATTTCTGATTTTAGTTATTGAGACTTCCCTCTTTTATCTTACTCAGTCGAACTAATTGTTCATTAATTTTGGTGATTTTTTCAAAGAACTGAACTTGGTTTTGCTAACTTACTCTACCATGTTCCTATTCTTTATTTCAGTTGTCTGTACTCTAGTCTTTATTATTTCTTTCCTTCTACTGGATTTGGGTTTAGTGTGTTCTCCCTTTTTCTACTTCTTTAAGGTATAATGTTAGATTGTTAATTTAAGATCTTTCTTCTTGTTTATCATAAGCATTTACACTATAAACTACCCTCCTAGCACAGATTTTGATGCATCTGGTAAGTTTTGGTATGTTTACTGTAGCCCTGCAATATAGTTTGAAGTCAGGTAATGTGATGCCTCCAGCTGTGTTCTTTTTGCTTAGGGTTGCCTTGGCCATTCGGGCTCTTTTTTGGTTCCATATGAATTTTAAAATAGTTTTTTCTAGTTCTGTGAAGAATGTCATTGGTAGCTTAATAAAAATAGCATTGAATCTGTACACTGCTTTGGGCAGTATGGTCATTTTAATAAGATTGATTCTTCCTATCTGTGAGCATGAGATTTTTAAAAATTTGTTTTTGTCTTACCTGATTTCTTTCAGCAGTGCTTTGTAATTCTCACTGCAGAGATCTTTCACCTCCCTGGTTAGCTGTATTCCTAGATATTTTWTCATTTTTGCAGCAATTGTGAATGAGATTGCCTTCCTGATTTGTTTCTCGGCTTGGTTTCTTCTTGTTGTTTGTGTACAGGAATGCTGGTGATTTTTCTACATTGATTTTGTATCCTGAAACTTTGCTGAAGTTGTTTATCAGCTGAAGGAGCTTTTGGGTCRAGACTATGGGTTTTTCTAGATATAGAATCATGTCATCTGCAAATAGGGATAGTCTGATATCCTCTCTTCCTATTTGGATATGCTTTATTTCTTTATTTTGCCTGATTGCTCTGGCTAAGACTTCCAATAATACTTGAATAGGATTGGTGAAAGAAGGCATTCTTGTCACGTGTTGGTTTTCAAAAGGAATTCTTCCAGCTTTTGCCCATTTAGTATGATGTTGCCTGTTAGTTTGTCACATATGGCTCTTATTATTTTGAGTTGTGTTCCAAAACATCATGGTGCTGGTACAAAAACAGGCACATAGACCAATGSAACAGATAGAGAGCCTAGAAATAAGACTGCACACTTACAACCATCTGATCTTCAACAAAGCTGACAAAAACAAGCAATGGGGAAAAGACTCCCTATTCAATAAATGGTACTTGGATAAGTGGCTAGCCATATGCAGAAGATTGAAGGTAGACCCCTTCCTTGCACCATATACCAAAATCAACTCAAGATGGATTAAAGACTTACATATAAAACCCAAAACTATAAAAAACCCTGGGAGACAACCTAGGCAATATTATCCTGTACATAGGAATGGGCAAAGATTTCATGACAAAGCAATCACAAAAGCAATCACAACAAAAACAAAAATTGACAAATAAGATCTAATTAAACTTAAGAGCTTCTGCACAGCAAAAGAAACTATCAGCAGAGTAAACAGACAACCTACAGGATGGCAGAAAATATTTGCATATTATGCATCTGACAAAGGTCTAATATCCAGCATCTATAAGAAACTTAAACAAGTTTATAAGCAAAAAACAAACAACCCCATTAAAAAGGGGGCAAAGGACATGAACACTTCTCAAAAGAAGACATACGTGCAACCAACAAGCATATGAAGAAAAGCTCAATATCACTGATCATTAGAGAAATGCAAATAAAAACCACAACGAGATACTGTCTCACAACAATCAGAATAGCATTATTAAAAATTCAAAAAAATAACAGATACTGGTGAGGTTGTGGAGAAAAGGGACCACTTATACACTGTTGATGAAAGTGTAAGTTAGTTCAACCATTGTGGAAAGCAGTATGGCGATTCTTCAAAGAAAGAGCTAAAAACAGAATTACCATTCAACTCAGGAATCCCATTACTGGGTATATGCCCAGAGGAATATAAATCATTCCACCATAAAGACACATGCACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAATCTCGGCTCACTGCAACCTCCGT     T       0       q5      IMPRECISE;SVTYPE=DEL;SVLEN=-4698;END=131597155;CIPOS=-13,13;CILEN=-2,2;RE=10;RNAMES=NULL;AF=0.1471;STRAND=+-;AC=0;AN=2;CSQ=deletion|intergenic_variant|MODIFIER||||||||||||||||1||||1|||||||||||||||||||||||||||||||||||||||||||||||||| GT:DR:DV:PL:GQ  ./.:.:.:.:.     0/0:58:10:0,78,458:78   ./.:.:.:.:.

Command:

cuteSV --sample HG002 --genotype --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3 --threads 4 vip_AshkenazimTrio_HG002.cram.bam GCA_000001405.15_GRCh38_no_alt_analysis_set.fna cutesv_output.vcf .

We've replaced Sniffles2 with cuteSV which suffers from what seems to be the same issue including the REF_MISMATCH warnings mentioned in that issue. Actually this issue was the primary reason for the switch.

Greetings, @dennishendriksen

Meltpinkg commented 1 year ago

Hello, @dennishendriksen

In cuteSV, the content of REF column is extracted from the input reference genome, so that the R nucleotides comes from the reference genome. For the three questions mentioned above:

Hope it will help.

Best, Shuqi

dennishendriksen commented 1 year ago

Thank you for your quick reply.

The 'R' reference bases can be modified via using sed commands on the output VCF.

I can confirm that this is indeed what is stored in the reference genome. The sed workaround will probably work, thank you for the suggestion.

In order to produce valid VCF you might consider the following for a proper fix:

From https://samtools.github.io/hts-specs/VCFv4.3.pdf page 8: If the reference sequence contains IUPAC ambiguity codes not allowed by this specification (such as R = A/G), the ambiguous reference base must be reduced to a concrete base by using the one that is first alphabetically (thus R as a reference base is converted to A in VCF.)

The content of the REF column is extracted from the reference genome so that it is already matched. I keep seeing differences between the REF column content of cuteSV and the reference genome, for example:

chr10   132172946       cuteSV.BND.9    N       ]chr4:182834343]N       0       q5      IMPRECISE;SVTYPE=BND;RE=5;RNAMES=NULL;AF=0.1282;AC=0;AN=2

Check:

$ samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz chr10:132172946-132172946
>chr10:132172946-132172946
C

Alternative check:

$ bcftools norm --check-ref w --fasta-ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz cutesv.vcf.gz
REF_MISMATCH    chr10   132172946       N       C

Additional examples:

REF_MISMATCH    chr1    143203947       N       A
REF_MISMATCH    chr1    143203948       N       A
REF_MISMATCH    chr1    143211397       N       G
REF_MISMATCH    chr1    143211405       N       T
REF_MISMATCH    chr1    143242873       N       G
REF_MISMATCH    chr10   26893473        N       G
REF_MISMATCH    chr10   26894430        N       C
REF_MISMATCH    chr10   26894431        N       C
REF_MISMATCH    chr10   26894434        N       T
REF_MISMATCH    chr10   38918271        N       A
REF_MISMATCH    chr10   38918278        N       T
REF_MISMATCH    chr10   38918279        N       G
REF_MISMATCH    chr10   38918285        N       G
REF_MISMATCH    chr10   38969688        N       C
REF_MISMATCH    chr10   38969692        N       C
REF_MISMATCH    chr10   38969722        N       G
REF_MISMATCH    chr10   38969741        N       C
REF_MISMATCH    chr10   38969749        N       C
REF_MISMATCH    chr10   38969765        N       C
REF_MISMATCH    chr10   42066267        N       C
REF_MISMATCH    chr10   42066268        N       G
REF_MISMATCH    chr10   42119794        N       A
REF_MISMATCH    chr10   132172946       N       C
REF_MISMATCH    chr10   132172954       N       T
REF_MISMATCH    chr10   133004611       N       A
REF_MISMATCH    chr10   133740610       N       C
REF_MISMATCH    chr10   133740611       N       G
REF_MISMATCH    chr10   133740612       N       C
REF_MISMATCH    chr11   175283  N       C
REF_MISMATCH    chr11   176543  N       T
REF_MISMATCH    chr17   21290420        N       T

This seems like an cuteSV issue? Working around this issue is tricky, because the N REF is part of the ALT as well (e.g. N ]chr4:182834343]N). I'm curious to hear about your thoughts.

cuteSV sorts the VCF file in the lexicographical order now. To change the way of sorting records, you can use commands to modify the output VCF to sort it in other ways.

That makes perfectly sense. I'll do some postprocessing afterwards.

dennishendriksen commented 7 months ago

This seems like an cuteSV issue? Working around this issue is tricky, because the N REF is part of the ALT as well (e.g. N ]chr4:182834343]N ). I'm curious to hear about your thoughts.

We are still struggling with working around this issue. Any thoughts?