mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

Merge SVs with high percentage of overlap #13

Closed cai1991 closed 3 years ago

cai1991 commented 3 years ago

Hi,

I'm trying your pipeline to merge my SVs, which were generated by whole genome comparisons among several de novo assemblies, into a single vcf file. I'm wondering:

  1. if it is possible to merge SVs which are with high percentage of overlap but fail to meet the requirement of "max_dist" using Jasmine? Below lists two examples which Jasmine (only use "--output_genotypes" parameter, others are default) didn't merge. The two examples are related to the SVs in the figure. SV examples
    • Example 1: Same end breakpoint, 93% overlap

C3 10180346 0_INV27953 N <INV> . PASS END=10361415;SVLEN=181069;SVTYPE=INV;AVG_LEN=181069.000000;AVG_START=10180346.000000;AVG_END=10361414.000000;SUPP_VEC_EXT=10;IDLIST_EXT=INV27953;SUPP_EXT=1;SUPP_VEC=10;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV27953 C3 10192856 1_INV34939 N <INV> . PASS END=10361415;SVLEN=168559;SVTYPE=INV;AVG_LEN=168559.000000;AVG_START=10192856.000000;AVG_END=10361414.000000;SUPP_VEC_EXT=01;IDLIST_EXT=INV34939;SUPP_EXT=1;SUPP_VEC=01;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV34939

C3 29342378 0_INV27963 N <INV> . PASS END=29948423;SVLEN=606045;SVTYPE=INV;AVG_LEN=606045.000000;AVG_START=29342378.000000;AVG_END=29948422.000000;SUPP_VEC_EXT=10;IDLIST_EXT=INV27963;SUPP_EXT=1;SUPP_VEC=10;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV27963 C3 29342378 1_INV34950 N <INV> . PASS END=29973346;SVLEN=630968;SVTYPE=INV;AVG_LEN=630968.000000;AVG_START=29342378.000000;AVG_END=29973345.000000;SUPP_VEC_EXT=01;IDLIST_EXT=INV34950;SUPP_EXT=1;SUPP_VEC=01;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV34950

  1. For insertions, how to indicate the length of the variant with the SVLEN INFO field? Does SVLEN equal the length of inserted sequence? Is the example below correct? C1 498768 INS37 N <INS> . PASS END=498768;ChrB=C1;StartB=496550;EndB=496651;Parent=SYN44;VarType=ShV;DupType=.;SVLEN=102;SVTYPE=INS;STRANDS=+

Thank you very much in advance for your help.

Best regards, Chengcheng

mkirsche commented 3 years ago

Hi Chengcheng,

Thanks a lot for your interest in using Jasmine!

As for your first question, the best approach would be to have the distance thresholds depend on the length of the variants using the max_dist_linear parameter. While it doesn't explicitly look at overlap, it will give these large variants large distance thresholds so that they can be correctly merged with each other. I recommend something like this (though the exact values depend on the organism being studied and the upstream pipeline you are using): max_dist_linear=0.1 min_dist=50 --mutual_distance The --mutual_distance parameter was only added very recently to the Github build, so is not in the conda release yet if you are using that, but it will be added to conda in the next release later this week. Just to briefly explain the parameters:

For your second question, that format is correct. Jasmine can infer the length from the REF and ALT fields if they are filled out (so if they are e.g. A and ATGTATGCGT it will automatically use 9 as the SVLEN value). But if not, it falls back to the SVLEN field.

I hope that helps, and please don't hesitate to reach out with any other questions!

Best, Melanie

cai1991 commented 3 years ago

Hi Melanie,

Thanks a lot for your clear explanation. I will try based on your suggestions.

Best regards, Chengcheng

cai1991 commented 3 years ago

Hi Melanie,

I see you have added a new parameter (min_overlap) in Jasmine to set the minimum reciprocal overlap. I'm wondering how it works? If two variants have reciprocal overlap greater than "min_overlap", will Jasmine still take "max_dist_linear" or "max_dist" into account to decide whether to merge or not?

Best, Chengcheng

mkirsche commented 3 years ago

Hi Chengcheng,

When using this parameter, the overlap requirement is in addition to the breakpoint distance requirement. So Jasmine checks only variant pairs with breakpoints which are within the required merging distance of one another, and then among those only merges those with sufficient overlap.

I would still recommend using the max_dist_linear parameter to merge variant pairs which have high overlap but also large breakpoint distances, but this new setting is available in case you also want to avoid merging variants pairs with small breakpoint distances but little overlap.

Best, Melanie


From: cai1991 @.> Sent: Thursday, April 8, 2021 5:05:40 AM To: mkirsche/Jasmine @.> Cc: Melanie Kirsche @.>; Comment @.> Subject: Re: [mkirsche/Jasmine] Merge SVs with high percentage of overlap (#13)

  External Email - Use Caution

Hi Melanie,

I see you have added a new parameter (min_overlap) in Jasmine to set the minimum reciprocal overlap. I'm wondering how it works? If two variants have reciprocal overlap greater than "min_overlap", will Jasmine still take "max_dist_linear" or "max_dist" into account to decide whether to merge or not?

Best, Chengcheng

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmkirsche%2FJasmine%2Fissues%2F13%23issuecomment-815591116&data=04%7C01%7Cmelaniekirsche%40jhu.edu%7C7e86dfa0bc7d42a5ad7d08d8fa6d7c05%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637534695434984699%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4Gz41enWgkozngizrOu1P%2FWi32y5CREfZ8OR%2FqOWCns%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACIYVSQW42AOAKCDIDQLNRLTHVWWJANCNFSM42LRW4DQ&data=04%7C01%7Cmelaniekirsche%40jhu.edu%7C7e86dfa0bc7d42a5ad7d08d8fa6d7c05%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637534695434984699%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8%2FcHbnBGLzYfPm1sHWtCXMyVcNQRBpGkYkush6KXqus%3D&reserved=0.