mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

How are LEN and AVG_LEN calculated ? #28

Closed clairemerot closed 2 years ago

clairemerot commented 2 years ago

Dear Melanie, I'm trying to use Jasmine to merge SVs found across different tools. It is promising but I have some troubles dealing with the length output as it may have been encoded a bit differently in my assembly-based SV detection (from graph for instance). To better understand what happens, I was wondering how the field LEN and AVG_LEN were calculated in the output of Jasmine? More specifically, is this based on the length of the sequence in the REF/ALT field? On the given file SVLEN? On the difference between ENd and START? And does it make a difference to use the --use_end option?

Thanks for your help and thanks for this very useful software!! Claire

mkirsche commented 2 years ago

Hi Claire,

Sorry for the late response!

The SVLEN field is computed based on the REF/ALT fields if they are present and valid fields. Otherwise, is uses either the END or existing SVLEN fields (or the SEQ field if the variant sequence is encoded that way from older VCF formats).By default, Jasmine updates the END and SVLEN fields of output VCF entries to be consistent with each other and with the REF/ALT fields, but this can be disabled with the --leave_breakpoints if you do not want those fields to change with respect to the input files.

The --use_end option does not affect how SVLEN is computed - it just changes the internal representation of variants (and the computation of how close together different variants are).

Finally, while SVLEN is simply copied (after the adjustment I mentioned above) from the first entry merged into each output variant, the AVG_LEN is the average of this value for all variants which were merged into that one.

I hope that helps clarify things!! Melanie

clairemerot commented 2 years ago

Hi Mélanie, Thanks for your answer. I suspect that when the ALT and REf field are filled it does the difference of length, doesn't it? That would explain why my inversions end up being of length 0. I agree there is no simple solution particularly for variants which are not simple indels (length 0 vs length something). I am finding workarounds to simply re-encode the length. Thanks a lot Claire

mkirsche commented 2 years ago

Hi Claire,

Sorry for the late reply - I hope you were able to find a workaround! But to answer your question yes, jasmine does use the difference in REF/ALT lengths when they are filled, though setting SVLEN to the correct length beforehand or specifying both SVTYPE=INV and END= should result in the correct variant length being used by Jasmine.

Melanie