mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
174 stars 16 forks source link

merge not behaving as expected for DUP #23

Closed kmurphy902 closed 2 years ago

kmurphy902 commented 2 years ago

Hi Melanie,

I have the following sample input from the GIAB trio processed using DRAGEN:

chr1    144905843       DRAGEN:GAIN:chr1:144905844-144907253    N       <DUP>   .       PASS    SVTYPE=CNV;END=144907253;REFLEN=1410    GT:SM:CN:BC:PE:QS:FT:DN ./1:1.33898:3:1:47,51:8:cnvQual ./1:1.29469:3:1:101,103:6:cnvQual       ./1:1.48387:3:1:50,58:17:PASS:Inherited
chr1    144907253       DRAGEN:GAIN:chr1:144907254-144987033    N       <DUP>   .       PASS    SVTYPE=CNV;END=144987033;REFLEN=79780   GT:SM:CN:BC:PE:QS:FT:DN ./1:2.06779:4:38:51,19:95:PASS  ./1:2.06284:4:38:103,43:105:PASS        ./1:2.06813:4:38:58,25:96:PASS:Inherited
chr1    144987033       DRAGEN:GAIN:chr1:144987034-144998774    N       <DUP>   .       PASS    SVTYPE=CNV;END=144998774;REFLEN=11741   GT:SM:CN:BC:PE:QS:FT:DN ./1:1.41905:3:11:19,7:24:PASS   ./1:1.27235:3:11:43,14:8:cnvQual        ./1:1.52716:3:11:25,11:35:PASS:Inherited

I installed jasmine today using conda, and I am using the following command:

jasmine file_list=GIAB.cnv_filtered_noLowDQ.vcf --normalize_type --allow_intrasample --output_genotypes --comma_filelist --nonlinear_dist --max_dist=5000 --ignore_strand out_file=jasmine_GIAB.cnv_filtered_noLowDQ.vcf

Merging is occurring (I am able to see events that were successfully merged in the output) but the only events that are merged are DEL. For example:

chr1    207541223       0_DRAGEN:LOSS:chr1:207541224-207542261  N       <DEL>   .       PASS    SVTYPE=;END=207542261;REFLEN=1038;SVLEN=-1038;STARTVARIANCE=2146376.000000;ENDVARIANCE=2301368.000000;AVG_LEN=-1560.333333;AVG_START
=207542733.666667;AVG_END=207544294.000000;VARCALLS=3;ALLVARS_EXT=(DRAGEN:LOSS:chr1:207541224-207542261,DRAGEN:LOSS:chr1:207542262-207544717,DRAGEN:LOSS:chr1:207544718-207545904);SUPP_VEC_EXT=1;IDLIST_EXT=DRAGEN:LOSS:chr1:207541
224-207542261;SUPP_EXT=1;STRANDS=??;SUPP_VEC=1;SUPP=1;SVMETHOD=JASMINE;IDLIST=DRAGEN:LOSS:chr1:207541224-207542261;INTRASAMPLE_IDLIST=DRAGEN:LOSS:chr1:207541224-207542261,DRAGEN:LOSS:chr1:207542262-207544717,DRAGEN:LOSS:chr1:207
544718-207545904    GT:IS:OT:DV:DR  0/1:.:CNV:0:.   ./.:.:CNV:0:.   0/1:.:CNV:0:.

All of my DUPs that should be merged are not being merged in the output, for example:

chr1    144905843       0_DRAGEN:GAIN:chr1:144905844-144907253  N       <DUP>   .       PASS    SVTYPE=;END=144907253;REFLEN=1410;SVLEN=1410;STARTVARIANCE=0.000000;ENDVARIANCE=0.000000;AVG_LEN=1410.000000;AVG_START=144905843.000
000;AVG_END=144907253.000000;VARCALLS=1;ALLVARS_EXT=(DRAGEN:GAIN:chr1:144905844-144907253);SUPP_VEC_EXT=1;IDLIST_EXT=DRAGEN:GAIN:chr1:144905844-144907253;SUPP_EXT=1;STRANDS=??;SUPP_VEC=1;SUPP=1;SVMETHOD=JASMINE;IDLIST=DRAGEN:GAI
N:chr1:144905844-144907253;INTRASAMPLE_IDLIST=DRAGEN:GAIN:chr1:144905844-144907253      GT:IS:OT:DV:DR  ./1:.:CNV:0:.   ./1:.:CNV:0:.   ./1:.:CNV:0:.
chr1    144907253       0_DRAGEN:GAIN:chr1:144907254-144987033  N       <DUP>   .       PASS    SVTYPE=;END=144987033;REFLEN=79780;SVLEN=79780;STARTVARIANCE=0.000000;ENDVARIANCE=0.000000;AVG_LEN=79780.000000;AVG_START=144907253.
000000;AVG_END=144987033.000000;VARCALLS=1;ALLVARS_EXT=(DRAGEN:GAIN:chr1:144907254-144987033);SUPP_VEC_EXT=1;IDLIST_EXT=DRAGEN:GAIN:chr1:144907254-144987033;SUPP_EXT=1;STRANDS=??;SUPP_VEC=1;SUPP=1;SVMETHOD=JASMINE;IDLIST=DRAGEN:
GAIN:chr1:144907254-144987033;INTRASAMPLE_IDLIST=DRAGEN:GAIN:chr1:144907254-144987033   GT:IS:OT:DV:DR  ./1:.:CNV:0:.   ./1:.:CNV:0:.   ./1:.:CNV:0:.
chr1    144987033       0_DRAGEN:GAIN:chr1:144987034-144998774  N       <DUP>   .       PASS    SVTYPE=;END=144998774;REFLEN=11741;SVLEN=11741;STARTVARIANCE=0.000000;ENDVARIANCE=0.000000;AVG_LEN=11741.000000;AVG_START=144987033.
000000;AVG_END=144998774.000000;VARCALLS=1;ALLVARS_EXT=(DRAGEN:GAIN:chr1:144987034-144998774);SUPP_VEC_EXT=1;IDLIST_EXT=DRAGEN:GAIN:chr1:144987034-144998774;SUPP_EXT=1;STRANDS=??;SUPP_VEC=1;SUPP=1;SVMETHOD=JASMINE;IDLIST=DRAGEN:
GAIN:chr1:144987034-144998774;INTRASAMPLE_IDLIST=DRAGEN:GAIN:chr1:144987034-144998774   GT:IS:OT:DV:DR  ./1:.:CNV:0:.   ./1:.:CNV:0:.   ./1:.:CNV:0:.

Any ideas? Thank you in advance!

mkirsche commented 2 years ago

Hi and thanks for your interest in Jasmine!

Unfortunately it seems like the kind of merging you are interested in is slightly different than what Jasmine does. If I understand your issue correctly, you have three nearby duplicated regions on chr1: 144905844-144907253,144907254-144987033, and 144987034-144998774. And you want to merge those into a single duplicate call spanning 144905844-144998774, meaning that the three original calls represent three parts of the same true SV.

On the other hand, Jasmine is more specialized for merging across samples and identifying cases where the exact same SV is represented slightly differently in different samples. When the --allow_intrasample flag is used, the same logic is applied to calls in the same sample, allowing Jasmine to capture cases where the exact same SV is output multiple times; the most common case of this is when two distinct sets of reads support the same SV call but with slightly different breakpoints. In doing this, Jasmine assumes that each SV call in the input VCFs represents a complete structural variant and "removes" duplicates in the same sample by merging them all into a single call. And because of this assumption about the input it is not able to combine multiple "parts" of the same variant into a single call.

More specifically, the reason the duplication calls are not merging is that Jasmine's distance threshold is based on both breakpoints. So when considering the first two variants, Jasmine represents them as (144905844, 1409 [the length of the SV call, or end - start]) and (144907254, 79779). Since the Euclidean distance between these points exceeds the distance threshold 5000, they are not merged. While increasing the threshold significantly would allow them to be merged, I would not recommend doing so since even if they are merged, the coordinates will not be correctly combined to represent the bigger range, and such a lenient threshold could also lead to false merges elsewhere.

I hope that helps clear things up a bit - please let me know if you have any further questions!

Melanie

mschatz commented 2 years ago

Thanks Melanie! Karyn, let us know if you have any questions. We would also be open to discussion if there are other features that you need.

Cheers

Mike

On Mon, Aug 30, 2021 at 5:45 PM Melanie Kirsche @.***> wrote:

Hi and thanks for your interest in Jasmine!

Unfortunately it seems like the kind of merging you are interested in is slightly different than what Jasmine does. If I understand your issue correctly, you have three nearby duplicated regions on chr1: 144905844-144907253,144907254-144987033, and 144987034-144998774. And you want to merge those into a single duplicate call spanning 144905844-144998774, meaning that the three original calls represent three parts of the same true SV.

On the other hand, Jasmine is more specialized for merging across samples and identifying cases where the exact same SV is represented slightly differently in different samples. When the --allow_intrasample flag is used, the same logic is applied to calls in the same sample, allowing Jasmine to capture cases where the exact same SV is output multiple times; the most common case of this is when two distinct sets of reads support the same SV call but with slightly different breakpoints. In doing this, Jasmine assumes that each SV call in the input VCFs represents a complete structural variant and "removes" duplicates in the same sample by merging them all into a single call. And because of this assumption about the input it is not able to combine multiple "parts" of the same variant into a single call.

More specifically, the reason the duplication calls are not merging is that Jasmine's distance threshold is based on both breakpoints. So when considering the first two variants, Jasmine represents them as (144905844, 1409 [the length of the SV call, or end - start]) and (144907254, 79779). Since the Euclidean distance between these points exceeds the distance threshold 5000, they are not merged. While increasing the threshold significantly would allow them to be merged, I would not recommend doing so since even if they are merged, the coordinates will not be correctly combined to represent the bigger range, and such a lenient threshold could also lead to false merges elsewhere.

I hope that helps clear things up a bit - please let me know if you have any further questions!

Melanie

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mkirsche/Jasmine/issues/23#issuecomment-908722289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP346MWYAPUYTZXXIUIOTT7P3V3ANCNFSM5DCSPI7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

kmurphy902 commented 2 years ago

That makes perfect sense. I was hoping to be able to use jasmine on DRAGEN output--their joint genotyper splits events up like this which is less than ideal for my purposes! Thanks again for getting back to me so quickly