oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

LINE search takes much longer compared to other steps #421

Open foriin opened 5 months ago

foriin commented 5 months ago

Hi Shujun,

This is not a bug report, but a question. I've noticed that when I run EDTA on Drosophila genome, it takes an extraordinary amount of time when searching for LINEs. Drosophila genome is populated mostly by LTRs but it takes 5-10 times more time for EDTA to look for LINEs. Is there a way to improve the speed of this step? If it's a pure repeatmasker/repeatmodeller or blast, maybe it could've been done in parallel? I can't understand how running Repeatmodeller on 150 Mb genome with 16 cores in parallel could take 10 hours...

Cheers, Artem

oushujun commented 5 months ago

Hi Artem,

Unfortunately, this is the case. The LINE search function is carried out by RepeatModeler which is slow on even small genomes. Because RepeatModeler's search is based on copy number and multiple alignments, splitting the genome into small subsets may lose families that are already low copy. You can run EDTA on SSD, which will significantly improve your RepeatModeler/RepeatMasker runs because they are I/O intense.

Shujun

foriin commented 5 months ago

Thanks, Shujun, The cluster I ran EDTA on has only SSD, I think :) I see the problem now: we need to parallelize RM, but it has to establish communication between all the jobs in parallel. Could you please tell me what specific part of RM is assigned for LINE search?

oushujun commented 5 months ago

RM2 is described here: https://www.pnas.org/doi/10.1073/pnas.1921046117. Fig 1 shows the workflow. Currently, the whole RM2 workflow is executed, and SINE/LINE elements are harvested at the end output of RM2. If a particular module can be separated, or RM2 being further acclerated, it would be great!

Shujun