statgen / Minimac4

GNU General Public License v3.0
56 stars 19 forks source link

Imputation of Structural Variants #40

Closed jjfarrell closed 3 years ago

jjfarrell commented 3 years ago

Does Minimac4 handle structural variant imputation? The ref/alt in a VCF for SVs are often represented without the sequence but with a placeholder. For example, the ALT will simply be represented as a<DEL>,<DUP> or <INS> rather than the full sequence. The ref may be represented with an N. Will Minimac4 handle these formats? If the ref/alt are converted into a sequence is there any limit on the ref/alt size that will be then be an issue?

jjfarrell commented 3 years ago

I went ahead and created an imputation panel with 170K SVs. The resulting imputed output did not include most of the SVs. The variants imputed did not have a size greater than 225. Is there a workaround for this? The ref and alts of SVs were represented as sequence similar to the SNP/Indels rather than using<DEL> <INS>for the alt.

jonathonl commented 3 years ago

Minimac4 will not do anything special to interpret the REF and ALT for SVs. There "shouldn't" be a limit on the REF/ALT sequence lengths if you were to convert them to long sequences, but this has likely never been tested.

Either way, there would need to be some overlap between your reference panel and the array genotypes you are trying to impute. So your reference panel would benefit from including SNP and short Indel variants as well as the SVs. Even then, I have no idea what level of accuracy to expect from such an analysis.

If you provide the log file from your test, I might be able to provide more insight.

jjfarrell commented 3 years ago

Here is some more background. The reference panel includes GATK SNVs and indels along with a set of SVs called from about 5000 samples. While the SNPs and indels are imputed from the panel, most SVs are not output. None with Alt length greater than 255 suggesting that here is some sort of limit on the String.

Here is one region with a large deletion....

The SV is in the imputation panel and has an AC=43.

zcat m3vcf_v3/adsp-5k_v3.chr19.m3vcf.gz |grep -v ^#|cut -f1-8|grep -v '*'|grep 49943755

chr19   49943755        chr19:49943755  C       CGGTCTTGGGAGCTGCCTGCCGAGCATTCATGCTGGTGACCCTGAATGCAGGGGAGAAAGGGGGTCAAATTAGGGTCATGGGGGCTA .       .       B27831.M35;Err=0.0099943;Recom=0.0012395
chr19   49943755        chr19:49943755  C       CGGTCTTGGGAGCTGCCTGCCGAGCATTCATGCTGGTGACCCTGAATGCAGGGGAGAAAGGGGGTCAAATTAGGGTCATGGGGGCTAAGCG     .       .       B27831.M36;Err=0.0099943;Recom=0.0012395
chr19   49943755        chr19:49943755  C       T       .       .       B27831.M37;Err=0.0099943;Recom=0.0012395
chr19   49943755        chr19:49943755  CAGACAGTATGTTGAATGGGAGACAATACTTGCAAATTATTCATCTGACCAAGGACTAATATCCAGAATATACAAGGAATGCAAACAACTCAACAGCAAAAGAACAACCACATTAAAAAGCAGGCAAAGGACATGAATAGATATTTCTAAAAATAAAACATAATGGTCGGGCATGGTGGTTCATGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCAGGCGGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTACTAAAAATATATATTAAAAAATTAGCTGGGCATGGTGGTGCATGCCTGTAGTCCTAGCTACCCAGGGGGTCAAGGCAGGAGACTCGCTTGAACCTGGGAGGTGGAGGTTGCAGTGTGCCGAGATCACACCATTGCACTCCAGCCTGGGTGACAGAGTGAGACTCCATCTTAAACAAACAAAAAAAGAAGACATACAAATGGCCAACAGGAATATGAAAAAATATTCTAATCCCTAATCATCGGAGAAATGCAAATAAAAACCACAGTGAGATATCATCCTACCCTGGTTAGAATGGCTATTATAAAAAAGACAAAAAATAACATGCCTGGAAAGATTCCATCCCAGAAAAGTGAACTCTTCTTTTTTTTTTTTTTTTTTGAGACAGAGTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGCACCATCTTGGCTCACCGCAACCTTCGCCTCCCGGGTTCAAGCGCTTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGCATGCGCCACCACGCCCGGCTAATTTTGTATTTTCAGTAGAGACAGGGTTTCTCCATATTGGTCAGGCTGGTCTCAAACTCCTGACCTCAGGTGATCTGCCTGCCTCAGCCTCCCAAAGGGCTGGGATTACAGGCGTGAGCCACCGCGACCAGCCAGAAAAGGGAACTCTTCTACACTGTTGGTGGAAATGCAAATTAGTGTAGTCATTATGAGAAGAGTATGGCGATTTCTTCAAAAACTAAACCTAGAACTACCATATGAGTCAGCAATCCCACTACTGGGTATGTATCCAAAAGAAAGGAAGTCAATATATCAAAAGGATACCTGCACCTCCATGTAGATTACACCACTATTCATAACAGTAAAGATACGGAATCAACCTAAATGTCTATCAATTTATCTATTTCTTTTATAAGAAAATCCTCTCAGGCTGGGTGCGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGCCCAGGCGGGTGGATCACGAGGTCAGGAGATTGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAGATATAAAAAATTAGCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTAGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCTGTCTCAAAAAAAAAAAAAAAAAATCTTTATCCTCATTGTTTCCACATTGAGTAGGCTGAGGAGGAAGAGGAAGAGGAGAGACTCATTTTGCTGTCTCAGGGGTGGCAGAAGAGGAAGAAAATCTGTGTATAACCCAGTGTAAACTTGTGCTGTTCAAAGGTCAGCTGTACACACACATAGACACACAAAACATGCATAATACAGTTGAATACTACTCAGCCATAAGATAAAGAAATCCTGCCATTTGCAACAACAGAGTTAACTCTAAATCAGACGTAGACAAATACCGTGCTATCTCACGGATGCGTGGAATCCACAGAAGTCAAACTGTCACTAGAGAGTTGTCCAGGTTCTTGGCGTGTTGAACAAAGAATTGAACAAAATGCACAAATCAAGTAACAAAAGAAAATGCAAGGAGGGAAAAACCAGCAGAAGAATGGAGTAACAAAAGCACAGATTTTTTTTAAGTTCTGGGGTACATGTGCAGGACGTGCAGCTTTGTTACATAGGTAAACGTGTGCCATGGTGGTTTGCTGCACCTATCAACCCATCACCTAGGTATTAAGCCCAGCATGCATTAGCTGTTTTTCCTAATCCCGTCCCTCCCCCGGCCCCTTGACAGGCCCCAGTGTGGGTTTTTCCCCTTCCTGTGTCCATGTGTTCTCATTGTTCAGCTCCTACTTATAAGTGAGAACATGTGGTGTTTGGCTTTCTGTTCCTGTGTTAGTTTGCTGAGGATAATGGCTTCCGGCTCCATCCATGTCCCTGCAAAGGACATGATCATCCTTTTTTCTGGCTGCATAGTATTCCATGGTGTATATGCACCATTTTCTTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTCTTACTCTGTTGCCCAGGCTGGAGTGCAGTGGCGTAATCTTGGCCCACTGCAACGTCTGCCTTCCGGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCACCCGCCACCACACCCTGCTTTTTTGTATTTTTAGTAGAGACGGGGTCTCACCATGTTGGCCAGGATAATCTCCATCTCTTGACCTCGTGATCCACCTGCCTCGGCCTCCCAAAGTTCTGGGATTACAGGCGTGAGCCACTGCACCCAGCCCACCACATTTTCTTTATCCAGTCTATCATTGATGGGCATTTGGGTTGATTCCAGGTCTTTGCTATTGTGAATAGAAAAGCACAGATTTATTGAAGACAATTCAGAGTGGGAGCGGGATCGAGCCAGCAGCTCAAGAGCCTCCCCAATTAAGGTTTGTAATAAGCTAGAAGGAACCCGGCAACACCCCTAGACGCCCTTCAGAGGCTTCCAATTGGTTAGATGCTATGAAGGATTGGCTCAGGACCAATCAGTGGCCGTAGTGGAGACCCGGCCCGCAGTCAATCAGAGGCTGATGTGGCTTGTTATCACCAGGGAAGACGTGACCTGTAAGCCACACCTGCTGCTCTCCTGCCTGTAGGAACTGGCTGCACCTGCTGAACCCCTGTTCCTCTAATCCCCTATTCTCCTGCCACAAAACTCATGGAAACAAAGTAGAATTAGAACCGTGGTTGCTGGGGGCTAGGAGGGGAGAAAAACAGACAGATGTCGGTCAAAGGGTAGAAACTTGCATTTACAAGATGGGGTCAGCCCGGCATGGTGGCTCACACCTGTAATCTTAGCACTTTTGGAGGCTGAGGCGGGAGGACTGCTTGAGGCCAGGACTTTGAGACCAGCCTGGGCAACATAGTGAGATCTCGTTTCTTTCTTTTCTTTTCTTTTTTTGAGACAGAGTTTTGCTTTTGGCGTCCAGGCTGGAGTGCAATGGCGCGATCTCAGCTCACTGCAGCCTCGGCCCCCCGGGTTTAAGTATTTCTCCTGCCTCAGCCTCCCTAGTAGCTGGGATTATAGGCACCCACTACCAGGCCCTGCTAATTTTTGTATTTTTAGTAGTGATGGCATGATTACAGGCGTGAGGCACCGCACCCTGCCGTGAGATCCCGTTTCTACAAAATAAATAAATAAATAAATAATTACCTGGGTGTGATGGCATGTGCCTGTGGTTCTAGCTACTTAGGAGGCTTTAGTGGGAGGATCACTTGAGCCCAGGAGATTGAGGCTGCAGTGAGCTGCAGTGAGCTGCAGTGCCACTGCACTCCAGCCTGGGTAGCAGAGCAAGACCCTGTCTCAACAACCAACCAACCAACAATCAGTTCTGGGAGTCTACTGTACAGCATGGTGACTACAGTTAACAATACTGTATCGCATACTTGAAATTTGTTAACAGAGTACATCTTAAGTGTTTCTGCCACACACACACCCAAAGAGAACTGTGGAGGTGGCTGGGCATGGTGGCTCATGCCTGTAATCCCAGCACTTTGAGAGGCCGAAAGGGGCAGATCACCTGAGGTCAGGAGTTCGAGATTAGCCTGGCCAACATGGTGAAATCCTGACTCTACTAGAAATACAAAAATTAGCCAGGCATGCTGGCACGCGACTGTAATCCCAGCTACTCGAGAGGCTGAGGCAGGAGAATCGCTTGAAACCCGGGAGGTGGAGGTTGCAGTGAGCTGAGATCGTGCCATTGCACTCCAGCCTGGGTGACGGAGCGAGACTTGGTCTTAAAAAAAAAAAAAAAAAAGGAGAGAGACCTATGAAGGTGATGGAACAGGTAGTGCTCATTTCAGAATGTATACAAATATCAAAACATCATGATGTTCATGGTAAATATACTCAATTTTTATCTTTCAGTTATGCCTCAAAAATCTGGAAAAAGAAAAAAAAAACCCTTTAGAGTGGAGGAGAAGGAAGCAAGTGGGGCACAGGGAGAAACGAGCTGAGACACAGACCAACCCCAGCGTCAGCTGAGACTTCAGAGACTTCTGAGGCTGGAACTGCCCTGCAGAGGCCACCATCAGGGTCAGGATGGTCAGAACTTTATACCTAGTGGGCTCCCCTAAGAAAAAGTGTGACTCAGGCGAGGGACTCGCTGCACACACGCCTCTGTGGGATACTTTATTTATTTATTTTTGAGATAGAGCCTCACTCTGTTGCCCAGGCTGGAGTGCAGTGGTGCAATCTCGGCTCACTGCAACCTCTGCCTCTTGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTCAGATTACAGGTGCGCACCACCATGCCCAGCTAATTTTCTTGTATTTTTAGTAGAGATGGGGTTTCACCATGTTGGCCAAGATGGCCTCAATCTCCTGATCTCGTGATCCACCTGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCTACTGTGCCCGGCCTTATTTATTTATTTTTGAGACAGAGCCTCACTCTGTTGCCCAGGCTGGGGTGCAGTGGTGCAATCTCAGCTCGCTGCAACCTCTGCCTCCTGAGTTCAAGTGATTCTCGTGCCTCAGCCTCCTGAGTAGCTGGGTCTACAGGTGTGCACCACTACATCTGGCTAATTTTTGTATTTTTGGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTTGAACTCCTGACCTCATGTGATCTGCCTGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACTGCACCCAGCCTGGATGCTTTATTTTAATGATTTCCTTTCTGTCCCACTACAGCCCTGTGATTTAGGCTTATTGTGTCTATTTTACTGAGGAGTAAGATGACTTACTCAAACTCAGGCTTTATTTATTCATTTATTTATTTAGAAACAGAGTCTCACTCTGTTGCCCAGGCTGGAGTGCAGTGGGACGATCTCGGCTCACTGCAACCTCCACCTCCCGGGTTCAACCAATTCTCTGCCTAGTCTCCCGAGTAGCTGGGATTTTAGGCGCCCGCCACCATGCCCGGCTAATTTTTGTATTTTTAGTAGAGATGGGGTTTCACCATCTTGCCCAGGCTGGTCTTGAACTCCTGACTTGTGATCCACCCACTTCAGCCTCCCAAAGTGTTGAGATTACAGGCGTGAGCCACCATGCCCAGCCTTTTTTTTTTTTTTAAATGTTCAAATGGGAACCACTTGGACTTGGTCCTCTCACTCTCCCTCTTCTGAAGGAAGAGCATGGTCATCAACGGGGAATGGCAGTTGCAGCAAACAACTCCAGGAGCTGGCTTCTCGTTCTGGAGAGCACCCTGTGCCTCCTCTGCCTGGTTTCCTGTGCTTTACACATCCAGAGAAGCTTCTGTAGTAATGAACCATAGACACGATGCCTCAAAGTGTCATCTTCAAACTCGCTCTGAATTGAAAGTATAATCTTCAGCCAGCTGTGGTGGCTCACGTCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGGGCAGATTGCTTGAGCCCAGGAGTCTGAGACCAGCCAGGGAAACATGGCAAAACCCTGTCTCTACTAAAAATATAACAATGAGCCTGGCATGGTGGCTCACAATTGTAGTCCCAGCTACTTGGGAGGCTGAAATGGGAGGCTCACTTCAACCTGGCAGGTTGAGGCTACAGTGAGCTGAGATTGCACCACTGGACTCTGGCCTGGAGGACAGAGTCAGACCCTGTCTCAAAAAAAGTATAATCTTCAAACTCAAGCTCTTCATTGGGGATGGGGCTGAAATCTGAGTCCAGTTCTGGCCGTCACACCAGTGCGACTCCCACACACTTGCTGGTGTCCTGTTGCCATGGAGACCTCTTCACTTTGGAACCATCCCTGACATCTCCCTCTCCAATTGAAGCCCAAAGCCTGGGCCCCTCAGGGGCTGTCCTGTGTGGATCTTGATCTCCGAGTACTCGGTGGTGCTGGGGGCCTCCTGGTCCGCAGGCTCCCAGAGCCTCAGGCCCTGGAAGCTGAGGGAGGCATAGTGGAGCTCCTGCTCTTCCCCCTTCCCCGGGGTGTAGGTGGCTGCACCTGGGGGCGGGT  C 

The output does not have any SVs in this region or in the output just SNVs.

chr19   49943631        chr19:49943631:G:T      G       T       .       PASS    AF=0.00019;MAF=0.00019;R2=0.02821;IMPUTED
chr19   49943689        chr19:49943689:T:C      T       C       .       PASS    AF=0.00000;MAF=0.00000;R2=0.00205;IMPUTED
chr19   49943834        chr19:49943834:T:A      T       A       .       PASS    AF=0.00007;MAF=0.00007;R2=0.10445;IMPUTED
chr19   49943838        chr19:49943838:A:G      A       G       .       PASS    AF=0.00007;MAF=0.00007;R2=0.03131;IMPUTED
chr19   49943976        chr19:49943976:G:A      G       A       .       PASS    AF=0.00000;MAF=0.00000;R2=0.00080;IMPUTED
chr19   49943987        chr19:49943987:G:A      G       A       .       PASS    AF=0.00000;MAF=0.00000;R2=0.00000;IMPUTED
chr19   49943994        chr19:49943994:A:G      A       G       .       PASS    AF=0.00010;MAF=0.00010;R2=0.02764;IMPUTED
chr19   49944052        chr19:49944052:A:G      A       G       .       PASS    AF=0.00003;MAF=0.00003;R2=0.01014;IMPUTED
chr19   49944068        chr19:49944068:A:G      A       G       .       PASS    AF=0.00000;MAF=0.00000;R2=0.00660;IMPUTED
chr19   49944078        chr19:49944078:A:C      A       C       .       PASS    AF=0.00004;MAF=0.00004;R2=0.05528;IMPUTED
jonathonl commented 3 years ago

I notice that the SNPs all have the FILTER column set to PASS while the SVs have a missing filter. Do you have --passOnly enabled?

The log file "should" contain information about discarded variants. The relevant lines will start with NOTE ! if any exist.

Santy-8128 commented 3 years ago

Great catch Jonathon !! I think it's the PASS filter that is causing this.

On Tue, May 18, 2021, 1:19 PM Jonathon LeFaive @.***> wrote:

I notice that the SNPs all have the FILTER column set to PASS while the SVs have a missing filter. Do you have --passOnly enabled?

The log file "should" contain information about discarded variants. The relevant lines will start with NOTE ! if any exist.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/40#issuecomment-843528330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5YQCH3DLIG3LTBV4IIUADTOLDVZANCNFSM42M4SXUA .

jjfarrell commented 3 years ago

Thanks its working. R2=0.84 for the 6kbp deletion looks great given a frequency of AF=0.00616.

SunWinner01 commented 2 months ago

@jjfarrell Hello, I would like to inquire about the effectiveness of using this software for imputing structural variation

jjfarrell commented 2 months ago

Minimac works great with SVs. Just create a Reference panel from joint genotyped pVCFs of SVs(manta and graphtyper) and SNPs (GATK). The main issue is to make sure the quality of the SVs included in the panel are good so you are not phasing and imputing a lot of false positves SVs.