mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

Jasmine "missing" a variant? #22

Closed kmegq closed 2 years ago

kmegq commented 3 years ago

Thank you for writing this great tool! We have somatic variant calls from cancer cell lines that were called as unpaired tumor samples with Manta, and would like to filter them to remove variants found in a panel of normal germline samples run in Manta's diploid calling mode.

I found a variant which Jasmine thinks is only present in the cell line and none of the normals, but merging with SURVIVOR showed that the same variant is NOT present in the cell line, but is present in three of the normals. Looking at the VCFs by hand, the variant appears to be present in both the cell line and some of the normals. I'm not sure why Jasmine isn't seeing it in the normals? Could it be because it is an incompletely assembled insertion, and Manta does not seem to output an SVLEN for it in the tumor-only VCF, whereas it does output SVLEN in the diploid VCF? I thought maybe it was because the "ALT" allele isn't listed the same way in the cell line vs. the normals, but I tested my theory by changing the normal file's "ALT" to be "" as it is in the cell line, and Jasmine still did not "see" the variant in the normal.

Original call in cell line: chr1 1033410 MantaINS:22087:7:7:0:2:0 T <INS> . PASS END=1033410;SVTYPE=INS;CIPOS=0,19;CIEND=0,19;HOMLEN=19;HOMSEQ=CAATAGTCGGGTAGTTCTT;LEFT_SVINSSEQ=CAATAGTCGGGTAGTTCTTTTATTTTTTTTTTTATTTATTTATGATAGTCACACAGAGAGAGAGAGAGAGGCAGAGACATAGGCAGAGGGAGAAGCAGGCTCCATGTACCGGGAGCCCGACGTGGGATTCGATCCTGGG;RIGHT_SVINSSEQ=GGAGAAGCAGGCGCCATGTACCGGGAGCCCGACGTGGGATTCGATCCTGGGTCTCCAGGATCACGCCCTAGGCCAAAGGCAGGCGCTAAACCGCTGCGCCACCCAGGGATCC GT:PR:SR 0/1:8,2:14,34

Original call in one of the normal VCFs: chr1 1033410 MantaINS:141331:5:5:0:0:0 T TCAATAGTCGGGTAGTTCTTTTATTTTTTTTTTTATTTATTTATGATAGTCACACAGAGAGAGAGAGAGAGGCAGAGACATAGGCAGAGGGAGAAGCAGGCTCCATGTACCGGGAGCCCGACGTGGGATTCGATCCTGGGTCTCCAGGATCACGCCCTAGGCCAAAGGCAGGCGCTAAACCGCTGCGCCACCCAGGGATCC 999 PASS END=1033410;SVTYPE=INS;SVLEN=200;CIGAR=1M200I;CIPOS=0,19;HOMLEN=19;HOMSEQ=CAATAGTCGGGTAGTTCTT GT:FT:GQ:PL:PR:SR 1/1:PASS:151:999,154,0:0,15:0,55

Jasmine merge output. Not showing genotypes here, but you can see by the SUPP_VEC that the variant is only noted in the first sample (the cell line): chr1 1033410 MantaINS:22087:7:7:0:2:0 T <INS> . PASS END=1033410;SVTYPE=INS;CIPOS=0,19;CIEND=0,19;HOMLEN=19;HOMSEQ=CAATAGTCGGGTAGTTCTT;LEFT_SVINSSEQ=CAATAGTCGGGTAGTTCTTTTATTTTTTTTTTTATTTATTTATGATAGTCACACAGAGAGAGAGAGAGAGGCAGAGACATAGGCAGAGGGAGAAGCAGGCTCCATGTACCGGGAGCCCGACGTGGGATTCGATCCTGGG;RIGHT_SVINSSEQ=GGAGAAGCAGGCGCCATGTACCGGGAGCCCGACGTGGGATTCGATCCTGGGTCTCCAGGATCACGCCCTAGGCCAAAGGCAGGCGCTAAACCGCTGCGCCACCCAGGGATCC;IS_SPECIFIC=1;SVLEN=0;STARTVARIANCE=0.000000;ENDVARIANCE=0.000000;AVG_LEN=0.000000;AVG_START=1033410.000000;AVG_END=1033410.000000;SUPP_VEC_EXT=100000000000000000000000;IDLIST_EXT=MantaINS:22087:7:7:0:2:0;SUPP_EXT=1;SUPP_VEC=100000000000000000000000;SUPP=1;SVMETHOD=JASMINE;IDLIST=MantaINS:22087:7:7:0:2:0

The Jasmine command I used was: jasmine file_list=${id}_list.txt --nonlinear_dist max-dist=1000 --mark_specific spec_len=0 spec_reads=0 --normalize_type --output_genotypes --keep_var_ids out_file=${id}_pon_merged.vcf Thank you for your help with this!

Best, Kate

mkirsche commented 3 years ago

Hi Kate,

Thanks for your interest in Jasmine! I have identified a bug in Jasmine where the --nonlinear_dist option isn't being activated correctly. I have fixed it locally, but am having issues pushing the changes to this repository at the moment because of ongoing Github server issues. I will push the bug fix as soon as I can, but in the meantime you can force the non-linear distance threshold by adding the parameter max_dist_linear=0.

Also another note: You were right that the incorrectly assembled insertion can cause issues; the variant in the tumor cell line doesn't have any indicator of its length, and so when being compared to other SVs it is considered to have a length of 0 bp. This is fine for this pair of variants, since even 0 is close enough to the length of the other variant (200 bp) when using a distance threshold of 1000, but this might cause issues (regardless of the merging software you use) when looking at longer variants or using smaller distance thresholds.

I hope that helps! Melanie

mkirsche commented 3 years ago

Version 1.1.4 has been pushed to bioconda with the fix, so you should be able to update it and get the expected results with your Jasmine command (though with max_dist instead of max-dist). Please let me know if you continue to have issues!

Thanks, Melanie