repeat at the SV breakpoint

tjiangHIT / cuteSV

Long read based human genomic structural variation detection with cuteSV

MIT License

250 stars 36 forks source link

repeat at the SV breakpoint #127

Open charliechen912ilovbash opened 1 year ago

charliechen912ilovbash commented 1 year ago

Hi, I'm wondering if there exist repeat sequence (e.g. simple repeat) on the SV (e.g. deletion) breakpoint, will it affect the accuracy of SV position? or how does CuteSV v1.0.12 overcome this issue.

tjiangHIT commented 1 year ago

Hello @charliechen912ilovbash,

Sorry for replying so late. It is well known that the repeat sequence would disturb the alignment and report low-accurate breakpoints on the read. SV callers collect the breakpoints on each read to infer SV candidates. There is no doubt that treating the low-accurate breakpoints as SV signatures would produce low-quality SV positions. To overcome this, cuteSV clusters all breakpoint signatures in a relatively small region to generate "consensus" SV breakpoint groups, then divides them into possible SV events through length signatures. After that, report final SV calls and corresponding genotypes. For more details please read our paper here. I hope this is helpful to you.

Best regards, Tao

baozg commented 1 year ago

Hi, Tao

But for the assembly-based SVs calling, did cuteSV still cluster breakpoints? Since it is only one read in the sam, could it be possible for cuteSV to report these breakpoints?

tjiangHIT commented 1 year ago

Hello @baozg,

Thanks for pointing this out. Actually, cuteSV achieves assembly-based SV calling by converting the typical SV callsets to diploid-based SV callsets. That is, cuteSV generated the initial SV callsets first, which applied the cluster approach mentioned above (there is still more than one SV signature somewhere even though only one contig for a haplotype). Then cuteSV resolves the haplotype tags for each SV call to give phasing-genotype.

Tao

baozg commented 1 year ago

Hi, Tao

But for an inbreeding plant or haploid cell lines in humans, like A.thaliana or CHM13. It only have one haplotype, did this also need a clustering step.

Besides, as you mentioned, if I want to call variations with cuteSV with population-level assemblies, it would be better to use all the assemblies in one alignment file for this clustering step to refine the breakpoints, right?

Zhigui