Closed jjfarrell closed 3 years ago
I went ahead and created an imputation panel with 170K SVs. The resulting imputed output did not include most of the SVs. The variants imputed did not have a size greater than 225. Is there a workaround for this? The ref and alts of SVs were represented as sequence similar to the SNP/Indels rather than using<DEL> <INS>
for the alt.
Minimac4 will not do anything special to interpret the REF and ALT for SVs. There "shouldn't" be a limit on the REF/ALT sequence lengths if you were to convert them to long sequences, but this has likely never been tested.
Either way, there would need to be some overlap between your reference panel and the array genotypes you are trying to impute. So your reference panel would benefit from including SNP and short Indel variants as well as the SVs. Even then, I have no idea what level of accuracy to expect from such an analysis.
If you provide the log file from your test, I might be able to provide more insight.
Here is some more background. The reference panel includes GATK SNVs and indels along with a set of SVs called from about 5000 samples. While the SNPs and indels are imputed from the panel, most SVs are not output. None with Alt length greater than 255 suggesting that here is some sort of limit on the String.
Here is one region with a large deletion....
The SV is in the imputation panel and has an AC=43.
zcat m3vcf_v3/adsp-5k_v3.chr19.m3vcf.gz |grep -v ^#|cut -f1-8|grep -v '*'|grep 49943755
chr19 49943755 chr19:49943755 C CGGTCTTGGGAGCTGCCTGCCGAGCATTCATGCTGGTGACCCTGAATGCAGGGGAGAAAGGGGGTCAAATTAGGGTCATGGGGGCTA . . B27831.M35;Err=0.0099943;Recom=0.0012395
chr19 49943755 chr19:49943755 C CGGTCTTGGGAGCTGCCTGCCGAGCATTCATGCTGGTGACCCTGAATGCAGGGGAGAAAGGGGGTCAAATTAGGGTCATGGGGGCTAAGCG . . B27831.M36;Err=0.0099943;Recom=0.0012395
chr19 49943755 chr19:49943755 C T . . B27831.M37;Err=0.0099943;Recom=0.0012395
chr19 49943755 chr19:49943755 CAGACAGTATGTTGAATGGGAGACAATACTTGCAAATTATTCATCTGACCAAGGACTAATATCCAGAATATACAAGGAATGCAAACAACTCAACAGCAAAAGAACAACCACATTAAAAAGCAGGCAAAGGACATGAATAGATATTTCTAAAAATAAAACATAATGGTCGGGCATGGTGGTTCATGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCAGGCGGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTACTAAAAATATATATTAAAAAATTAGCTGGGCATGGTGGTGCATGCCTGTAGTCCTAGCTACCCAGGGGGTCAAGGCAGGAGACTCGCTTGAACCTGGGAGGTGGAGGTTGCAGTGTGCCGAGATCACACCATTGCACTCCAGCCTGGGTGACAGAGTGAGACTCCATCTTAAACAAACAAAAAAAGAAGACATACAAATGGCCAACAGGAATATGAAAAAATATTCTAATCCCTAATCATCGGAGAAATGCAAATAAAAACCACAGTGAGATATCATCCTACCCTGGTTAGAATGGCTATTATAAAAAAGACAAAAAATAACATGCCTGGAAAGATTCCATCCCAGAAAAGTGAACTCTTCTTTTTTTTTTTTTTTTTTGAGACAGAGTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGCACCATCTTGGCTCACCGCAACCTTCGCCTCCCGGGTTCAAGCGCTTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGCATGCGCCACCACGCCCGGCTAATTTTGTATTTTCAGTAGAGACAGGGTTTCTCCATATTGGTCAGGCTGGTCTCAAACTCCTGACCTCAGGTGATCTGCCTGCCTCAGCCTCCCAAAGGGCTGGGATTACAGGCGTGAGCCACCGCGACCAGCCAGAAAAGGGAACTCTTCTACACTGTTGGTGGAAATGCAAATTAGTGTAGTCATTATGAGAAGAGTATGGCGATTTCTTCAAAAACTAAACCTAGAACTACCATATGAGTCAGCAATCCCACTACTGGGTATGTATCCAAAAGAAAGGAAGTCAATATATCAAAAGGATACCTGCACCTCCATGTAGATTACACCACTATTCATAACAGTAAAGATACGGAATCAACCTAAATGTCTATCAATTTATCTATTTCTTTTATAAGAAAATCCTCTCAGGCTGGGTGCGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGCCCAGGCGGGTGGATCACGAGGTCAGGAGATTGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAGATATAAAAAATTAGCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTAGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCTGTCTCAAAAAAAAAAAAAAAAAATCTTTATCCTCATTGTTTCCACATTGAGTAGGCTGAGGAGGAAGAGGAAGAGGAGAGACTCATTTTGCTGTCTCAGGGGTGGCAGAAGAGGAAGAAAATCTGTGTATAACCCAGTGTAAACTTGTGCTGTTCAAAGGTCAGCTGTACACACACATAGACACACAAAACATGCATAATACAGTTGAATACTACTCAGCCATAAGATAAAGAAATCCTGCCATTTGCAACAACAGAGTTAACTCTAAATCAGACGTAGACAAATACCGTGCTATCTCACGGATGCGTGGAATCCACAGAAGTCAAACTGTCACTAGAGAGTTGTCCAGGTTCTTGGCGTGTTGAACAAAGAATTGAACAAAATGCACAAATCAAGTAACAAAAGAAAATGCAAGGAGGGAAAAACCAGCAGAAGAATGGAGTAACAAAAGCACAGATTTTTTTTAAGTTCTGGGGTACATGTGCAGGACGTGCAGCTTTGTTACATAGGTAAACGTGTGCCATGGTGGTTTGCTGCACCTATCAACCCATCACCTAGGTATTAAGCCCAGCATGCATTAGCTGTTTTTCCTAATCCCGTCCCTCCCCCGGCCCCTTGACAGGCCCCAGTGTGGGTTTTTCCCCTTCCTGTGTCCATGTGTTCTCATTGTTCAGCTCCTACTTATAAGTGAGAACATGTGGTGTTTGGCTTTCTGTTCCTGTGTTAGTTTGCTGAGGATAATGGCTTCCGGCTCCATCCATGTCCCTGCAAAGGACATGATCATCCTTTTTTCTGGCTGCATAGTATTCCATGGTGTATATGCACCATTTTCTTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTCTTACTCTGTTGCCCAGGCTGGAGTGCAGTGGCGTAATCTTGGCCCACTGCAACGTCTGCCTTCCGGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCACCCGCCACCACACCCTGCTTTTTTGTATTTTTAGTAGAGACGGGGTCTCACCATGTTGGCCAGGATAATCTCCATCTCTTGACCTCGTGATCCACCTGCCTCGGCCTCCCAAAGTTCTGGGATTACAGGCGTGAGCCACTGCACCCAGCCCACCACATTTTCTTTATCCAGTCTATCATTGATGGGCATTTGGGTTGATTCCAGGTCTTTGCTATTGTGAATAGAAAAGCACAGATTTATTGAAGACAATTCAGAGTGGGAGCGGGATCGAGCCAGCAGCTCAAGAGCCTCCCCAATTAAGGTTTGTAATAAGCTAGAAGGAACCCGGCAACACCCCTAGACGCCCTTCAGAGGCTTCCAATTGGTTAGATGCTATGAAGGATTGGCTCAGGACCAATCAGTGGCCGTAGTGGAGACCCGGCCCGCAGTCAATCAGAGGCTGATGTGGCTTGTTATCACCAGGGAAGACGTGACCTGTAAGCCACACCTGCTGCTCTCCTGCCTGTAGGAACTGGCTGCACCTGCTGAACCCCTGTTCCTCTAATCCCCTATTCTCCTGCCACAAAACTCATGGAAACAAAGTAGAATTAGAACCGTGGTTGCTGGGGGCTAGGAGGGGAGAAAAACAGACAGATGTCGGTCAAAGGGTAGAAACTTGCATTTACAAGATGGGGTCAGCCCGGCATGGTGGCTCACACCTGTAATCTTAGCACTTTTGGAGGCTGAGGCGGGAGGACTGCTTGAGGCCAGGACTTTGAGACCAGCCTGGGCAACATAGTGAGATCTCGTTTCTTTCTTTTCTTTTCTTTTTTTGAGACAGAGTTTTGCTTTTGGCGTCCAGGCTGGAGTGCAATGGCGCGATCTCAGCTCACTGCAGCCTCGGCCCCCCGGGTTTAAGTATTTCTCCTGCCTCAGCCTCCCTAGTAGCTGGGATTATAGGCACCCACTACCAGGCCCTGCTAATTTTTGTATTTTTAGTAGTGATGGCATGATTACAGGCGTGAGGCACCGCACCCTGCCGTGAGATCCCGTTTCTACAAAATAAATAAATAAATAAATAATTACCTGGGTGTGATGGCATGTGCCTGTGGTTCTAGCTACTTAGGAGGCTTTAGTGGGAGGATCACTTGAGCCCAGGAGATTGAGGCTGCAGTGAGCTGCAGTGAGCTGCAGTGCCACTGCACTCCAGCCTGGGTAGCAGAGCAAGACCCTGTCTCAACAACCAACCAACCAACAATCAGTTCTGGGAGTCTACTGTACAGCATGGTGACTACAGTTAACAATACTGTATCGCATACTTGAAATTTGTTAACAGAGTACATCTTAAGTGTTTCTGCCACACACACACCCAAAGAGAACTGTGGAGGTGGCTGGGCATGGTGGCTCATGCCTGTAATCCCAGCACTTTGAGAGGCCGAAAGGGGCAGATCACCTGAGGTCAGGAGTTCGAGATTAGCCTGGCCAACATGGTGAAATCCTGACTCTACTAGAAATACAAAAATTAGCCAGGCATGCTGGCACGCGACTGTAATCCCAGCTACTCGAGAGGCTGAGGCAGGAGAATCGCTTGAAACCCGGGAGGTGGAGGTTGCAGTGAGCTGAGATCGTGCCATTGCACTCCAGCCTGGGTGACGGAGCGAGACTTGGTCTTAAAAAAAAAAAAAAAAAAGGAGAGAGACCTATGAAGGTGATGGAACAGGTAGTGCTCATTTCAGAATGTATACAAATATCAAAACATCATGATGTTCATGGTAAATATACTCAATTTTTATCTTTCAGTTATGCCTCAAAAATCTGGAAAAAGAAAAAAAAAACCCTTTAGAGTGGAGGAGAAGGAAGCAAGTGGGGCACAGGGAGAAACGAGCTGAGACACAGACCAACCCCAGCGTCAGCTGAGACTTCAGAGACTTCTGAGGCTGGAACTGCCCTGCAGAGGCCACCATCAGGGTCAGGATGGTCAGAACTTTATACCTAGTGGGCTCCCCTAAGAAAAAGTGTGACTCAGGCGAGGGACTCGCTGCACACACGCCTCTGTGGGATACTTTATTTATTTATTTTTGAGATAGAGCCTCACTCTGTTGCCCAGGCTGGAGTGCAGTGGTGCAATCTCGGCTCACTGCAACCTCTGCCTCTTGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTCAGATTACAGGTGCGCACCACCATGCCCAGCTAATTTTCTTGTATTTTTAGTAGAGATGGGGTTTCACCATGTTGGCCAAGATGGCCTCAATCTCCTGATCTCGTGATCCACCTGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCTACTGTGCCCGGCCTTATTTATTTATTTTTGAGACAGAGCCTCACTCTGTTGCCCAGGCTGGGGTGCAGTGGTGCAATCTCAGCTCGCTGCAACCTCTGCCTCCTGAGTTCAAGTGATTCTCGTGCCTCAGCCTCCTGAGTAGCTGGGTCTACAGGTGTGCACCACTACATCTGGCTAATTTTTGTATTTTTGGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTTGAACTCCTGACCTCATGTGATCTGCCTGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACTGCACCCAGCCTGGATGCTTTATTTTAATGATTTCCTTTCTGTCCCACTACAGCCCTGTGATTTAGGCTTATTGTGTCTATTTTACTGAGGAGTAAGATGACTTACTCAAACTCAGGCTTTATTTATTCATTTATTTATTTAGAAACAGAGTCTCACTCTGTTGCCCAGGCTGGAGTGCAGTGGGACGATCTCGGCTCACTGCAACCTCCACCTCCCGGGTTCAACCAATTCTCTGCCTAGTCTCCCGAGTAGCTGGGATTTTAGGCGCCCGCCACCATGCCCGGCTAATTTTTGTATTTTTAGTAGAGATGGGGTTTCACCATCTTGCCCAGGCTGGTCTTGAACTCCTGACTTGTGATCCACCCACTTCAGCCTCCCAAAGTGTTGAGATTACAGGCGTGAGCCACCATGCCCAGCCTTTTTTTTTTTTTTAAATGTTCAAATGGGAACCACTTGGACTTGGTCCTCTCACTCTCCCTCTTCTGAAGGAAGAGCATGGTCATCAACGGGGAATGGCAGTTGCAGCAAACAACTCCAGGAGCTGGCTTCTCGTTCTGGAGAGCACCCTGTGCCTCCTCTGCCTGGTTTCCTGTGCTTTACACATCCAGAGAAGCTTCTGTAGTAATGAACCATAGACACGATGCCTCAAAGTGTCATCTTCAAACTCGCTCTGAATTGAAAGTATAATCTTCAGCCAGCTGTGGTGGCTCACGTCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGGGCAGATTGCTTGAGCCCAGGAGTCTGAGACCAGCCAGGGAAACATGGCAAAACCCTGTCTCTACTAAAAATATAACAATGAGCCTGGCATGGTGGCTCACAATTGTAGTCCCAGCTACTTGGGAGGCTGAAATGGGAGGCTCACTTCAACCTGGCAGGTTGAGGCTACAGTGAGCTGAGATTGCACCACTGGACTCTGGCCTGGAGGACAGAGTCAGACCCTGTCTCAAAAAAAGTATAATCTTCAAACTCAAGCTCTTCATTGGGGATGGGGCTGAAATCTGAGTCCAGTTCTGGCCGTCACACCAGTGCGACTCCCACACACTTGCTGGTGTCCTGTTGCCATGGAGACCTCTTCACTTTGGAACCATCCCTGACATCTCCCTCTCCAATTGAAGCCCAAAGCCTGGGCCCCTCAGGGGCTGTCCTGTGTGGATCTTGATCTCCGAGTACTCGGTGGTGCTGGGGGCCTCCTGGTCCGCAGGCTCCCAGAGCCTCAGGCCCTGGAAGCTGAGGGAGGCATAGTGGAGCTCCTGCTCTTCCCCCTTCCCCGGGGTGTAGGTGGCTGCACCTGGGGGCGGGT C
The output does not have any SVs in this region or in the output just SNVs.
chr19 49943631 chr19:49943631:G:T G T . PASS AF=0.00019;MAF=0.00019;R2=0.02821;IMPUTED
chr19 49943689 chr19:49943689:T:C T C . PASS AF=0.00000;MAF=0.00000;R2=0.00205;IMPUTED
chr19 49943834 chr19:49943834:T:A T A . PASS AF=0.00007;MAF=0.00007;R2=0.10445;IMPUTED
chr19 49943838 chr19:49943838:A:G A G . PASS AF=0.00007;MAF=0.00007;R2=0.03131;IMPUTED
chr19 49943976 chr19:49943976:G:A G A . PASS AF=0.00000;MAF=0.00000;R2=0.00080;IMPUTED
chr19 49943987 chr19:49943987:G:A G A . PASS AF=0.00000;MAF=0.00000;R2=0.00000;IMPUTED
chr19 49943994 chr19:49943994:A:G A G . PASS AF=0.00010;MAF=0.00010;R2=0.02764;IMPUTED
chr19 49944052 chr19:49944052:A:G A G . PASS AF=0.00003;MAF=0.00003;R2=0.01014;IMPUTED
chr19 49944068 chr19:49944068:A:G A G . PASS AF=0.00000;MAF=0.00000;R2=0.00660;IMPUTED
chr19 49944078 chr19:49944078:A:C A C . PASS AF=0.00004;MAF=0.00004;R2=0.05528;IMPUTED
I notice that the SNPs all have the FILTER column set to PASS while the SVs have a missing filter. Do you have --passOnly
enabled?
The log file "should" contain information about discarded variants. The relevant lines will start with NOTE !
if any exist.
Great catch Jonathon !! I think it's the PASS filter that is causing this.
On Tue, May 18, 2021, 1:19 PM Jonathon LeFaive @.***> wrote:
I notice that the SNPs all have the FILTER column set to PASS while the SVs have a missing filter. Do you have --passOnly enabled?
The log file "should" contain information about discarded variants. The relevant lines will start with NOTE ! if any exist.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/40#issuecomment-843528330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5YQCH3DLIG3LTBV4IIUADTOLDVZANCNFSM42M4SXUA .
Thanks its working. R2=0.84 for the 6kbp deletion looks great given a frequency of AF=0.00616.
@jjfarrell Hello, I would like to inquire about the effectiveness of using this software for imputing structural variation
Minimac works great with SVs. Just create a Reference panel from joint genotyped pVCFs of SVs(manta and graphtyper) and SNPs (GATK). The main issue is to make sure the quality of the SVs included in the panel are good so you are not phasing and imputing a lot of false positves SVs.
Does Minimac4 handle structural variant imputation? The ref/alt in a VCF for SVs are often represented without the sequence but with a placeholder. For example, the ALT will simply be represented as a
<DEL>,<DUP> or <INS>
rather than the full sequence. The ref may be represented with an N. Will Minimac4 handle these formats? If the ref/alt are converted into a sequence is there any limit on the ref/alt size that will be then be an issue?