kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
92 stars 11 forks source link

Getting SV length in dysgu output vcf #78

Closed TizianaS92 closed 9 months ago

TizianaS92 commented 10 months ago

Hi there, I'm trying to plot the lenght of the SVs detected by dysgu. I noticed there is a specific paramenter in the INFO column in the output .vcf (SVLEN) but it only appears to be accurate for DEL, DUP and INV types. I calculated the SV lenght by doing END in the INFO column - POS column

For INS cases the actual length is always different from SVLEN. Also, I was wondering...why is the SVLEN parameter missing completely in TRA variants?

How do I calculate the variant length in these two cases? And is it correct to look at 'SVLEN' in general?

Also, is there a lower/upper limit for the variants detected by the software? My lowest value is SVLEN=30, my highest... 47,303,837 (almost like a whole chromosome? I think it's quite unlikely for this SV to be an actual one... but I do have a few variants exceeding SVLEN > 1.000.000).

Thank you in advance.

kcleal commented 10 months ago

Hi @TizianaS92,

As you have identified, the SVLEN is generally not accurate for INS type events using paired-end reads. This is because paired-end reads do not span the whole insertion SV, so the size is estimated from the insert-size metrics. However, there are some cases when INS events are accurate in their SV length, although these are usually shorter in length. SVLEN for insertions is quite accurate when the SV is spanned by an alignment (SVs usually less than 50bp), or there were supplementary mappings used in calculating the SVLEN parameter. You can tell if this was the case if the metric WR > 0 (Within-Read), or the parameter LPREC=1 (length precise from supplementary mapping). If you want accurate SVLEN for insertions, you can try another caller like manta which will try and assemble the full insertion, although will have lower sensitivity.

Translocations events are not labelled with an sv length as there is no clear way to label them as such.

The lower limit for sv length detection is 30 bp, controlled using the --min-length parameter. Hope that helps

TizianaS92 commented 10 months ago

Hello, thank you for your reply. Yes, I suspected INS length wasn't accurate, because the longest I found (according to SVLEN) was just 523 bp (suspiciously short).

As I said before, I also found abnormally long DUP, INV and DEL variants. I'm working on Solanum lycopersicum (whose longest chromosome is around 90,000 Kb) ... and there are some variants exceeding 100 Kb (200 out of about 85.000) with some (70) even longer than 1,000 Kb or even longer than 10,000 Kb (22 variants).

Here are the 5 longest ones:

INV | 30377893 DUP | 31411671 INV | 33061041 DEL | 47303837 DUP | 47303962

What do you think about this? We expect individuals showing this variant not to be viable at all or have severely compromised phenotypes, so we were wondering if these findings may be a bug or something.

kcleal commented 10 months ago

Hi @TizianaS92 , here are some thoughts:

It is possible, perhaps likely, that the larger variants are false positives, but it is difficult to tell without being able to see the raw data. You can generate some images of the variant breakpoints using gw and share them here if you wish https://github.com/kcleal/gw

Having a large SVLEN does not always mean the whole chromosome is affected, for example the SV could be a copy-and-paste kind of event of a mobile element - this could give the appearance of a whole chromosome being affected but only a small amount of DNA has been copied.

It might be wise to filter your SVs against another control sample (if available), or possibly look at removing lower quality events around repeats. The SU and PROB fields in the vcf output can help here, as well as the FCC (fold-coverage-change)