nadegeguiglielmoni / GraphUnzip

Unzip assembly graphs with Hi-C data and/or long reads.
GNU General Public License v3.0
25 stars 1 forks source link

Segmental Duplications? #5

Closed GuillaumeHolley closed 2 years ago

GuillaumeHolley commented 3 years ago

Hi,

I am trying GraphUnzip on a haplotype-resolved assembly of a human genome and I could use some advice. The assembly is for a small region with many (large) segmental duplications and a really large inversion. The reads are ONT reads (~40x, n50 around 20-25kb) corrected with Illumina using the next Ratatosk version to come so the measured error rate is rather low (mean is 1.3%). The corrected reads were trio-binned and the haplotypes were assembled separately with Flye using the --pacbio-hifi mode and --keep-haplotypes. The assembly is a little bit fragmented but it is expected because of the segmental duplications. I used svim-asm to call SVs and I was rather happy to see that the large inversion was detected.

Now I used GraphUnzip with -A 0.4 -R 0.1 -mm 0.8 -e -wm to see if I could patch together some of the contigs corresponding to segmental duplications. The positive point is that a long deletion that was contained in several reads is now assembled into a single contig while it wasn't before. Negative point is that several contigs corresponding to segmental duplications that were mapping before as primary or supplementary alignments are now gone. More importantly, the inversion is gone too. Indeed, the contig made by Flye containing the inversion is now a contig of about the same length: after mapping, the primary alignment of that contig soft-clips on most of the read length but has no supplementary alignment.

So I guess my questions are:

Thank you very much! Guillaume

RolandFaure commented 3 years ago

Hi,

Thanks for your feedback, it is really helpful to us. Could you reformulate what you mean exactly by "the inversion is gone" ? I am not sure I understood exactly what you meant there: is the inversion incorporated the wrong way around in a longer contig ?

There are no theoretical reasons why GraphUnzip should not work of segmental duplications, but we have noticed in practice that very tricky structural variants involving many repeats are still too hard for GraphUnzip (we hope to improve that in the future). It is possible that your case falls in that category.

As for the parameters, you might want to raise the -mm option : as is, two sequence may be considered overlapping even if they have up to 20% bp differences, which seems a lot compared to the precision of the reads you have. I would try -mm 0.95 or even higher.

Roland

RolandFaure commented 2 years ago

Closing this issue, as it is obsolete with the new version of GraphUnzip. Do not hesitate to try improving the assembly again with the new version