schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
306 stars 36 forks source link

large, repetitive genome #123

Closed pjm43 closed 2 years ago

pjm43 commented 2 years ago

Hi, Just a quick question regarding large (12 Gb), repetitive plant genome. I'm following your tutorial with minimap2 -ax asm5 for the alignment, but the alignment is taking quite along time for just a single, albeit large (642 Mb) chromosome (still running after 3 days). I gave it 8 cpus each with 32 Gb mem. It's only using a single cpu (even through I flagged -t 8) and is currently using 85 Gb of the memory allocated. Any ideas how to speed up the alignment process?

Thanks in advance for any advice!

mnshgl0110 commented 2 years ago

Hi Jeff, I think the easiest solution might be to just mask these repeats. In case, analysis of repeats is also required, then maybe try something like -H -f 100 -q 2k --rmq=no --secondary=no. But I too am guessing here, so probably it would be better if you also ask this question at the minimap2 repo for better advice. I am also thinking how syri would handle this highly complex genome. It would be great if you could let me know whether it actually finishes and how long does it take (assuming that alignment would finish).

pjm43 commented 2 years ago

Hi Manish, Thanks for the quick response! I was worried that if I masked it might somehow be problematic for the downstream SyRI analyses.

Could I ask for a little additional help with the flags you suggested: -H -f 100 -q 2k --rmq=no --secondary=no. Are these flags for minimap2?

mnshgl0110 commented 2 years ago

I was worried that if I masked it might somehow be problematic for the downstream SyRI analyses.

Syri would not identify SRs in the masked regions and would instead output these masked regions as indels/not-aligned regions that would need to be filtered out, but other than that it should be OK.

Are these flags for minimap2?

Yes. Here is the documentation: https://lh3.github.io/minimap2/minimap2.html

pjm43 commented 2 years ago

When you say regions would need to be filtered out - how would I do that?

mnshgl0110 commented 2 years ago

When you mask the genomes, you would get list of regions that are masked. After running syri, you would need to remove sequence variations (everything other than synteny, inversions, translocations, and duplications), that are overlapping these masked region.

Unfortunately, I do not have any code on how to do that exactly, so cannot help with that.

pjm43 commented 2 years ago

Thanks for the help!