oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
351 stars 73 forks source link

Repeat masker stalling #480

Open yanarizzieri opened 4 months ago

yanarizzieri commented 4 months ago

Hello,

I have a hybrid plant genome with two haplotypes (Hap#1=1.6Gb with 22 chromosomes and Hap#2=1.8Gb with 24 chromosomes). I wanted to mask the genome, but RepeatMasker seems to be stalling in a weird pattern. I was able to finish the run on Hap#1 using EDTA(2.0.1), but when trying to mask Hap#2 OR the genome with both haplotypes, RepeatMasker gets stuck on identifying the LTR steps (it has been a month now) I installed EDTA 2.2 version on a new workstation and all of the runs are stuck on the LINE indentifcation step; I have attached the log files:

This is the log for when I tried running all 46 chromosomes:

Tue Jun 11 06:48:19 PM UTC 2024 EDTA_raw: Check dependencies, prepare working directories.

Tue Jun 11 06:48:20 PM UTC 2024 Start to find LTR candidates.

Tue Jun 11 06:48:20 PM UTC 2024 Identify LTR retrotransposon candidates from scratch.

Tue Jun 11 08:05:19 PM UTC 2024 Finish finding LTR candidates.

Tue Jun 11 08:05:19 PM UTC 2024 Start to find SINE candidates.

Tue Jun 11 10:23:20 PM UTC 2024 Finish finding SINE candidates.

Tue Jun 11 10:23:20 PM UTC 2024 Start to find LINE candidates.

Tue Jun 11 10:23:20 PM UTC 2024 Identify LINE retrotransposon candidates from scratch.

This is the log for when I tried running a single chromosome (Chr_11 -- 94Mb long):

Thu Jun 27 04:50:14 PM UTC 2024 EDTA_raw: Check dependencies, prepare working directories.

Thu Jun 27 04:50:15 PM UTC 2024 Start to find LTR candidates.

Thu Jun 27 04:50:15 PM UTC 2024 Identify LTR retrotransposon candidates from scratch.

Thu Jun 27 04:56:31 PM UTC 2024 Finish finding LTR candidates.

Thu Jun 27 04:56:31 PM UTC 2024 Start to find SINE candidates.

Thu Jun 27 05:04:11 PM UTC 2024 Finish finding SINE candidates.

Thu Jun 27 05:04:11 PM UTC 2024 Start to find LINE candidates.

Thu Jun 27 05:04:11 PM UTC 2024 Identify LINE retrotransposon candidates from scratch.

I recently realized that I could try to use EDTA_raw.pl as well, using --type line> I have recently started to run that to see if this also going to stall.

Any suggestions?

Thank you so much,

Y

oushujun commented 4 months ago

Hello,

Sometimes when the server is busy, the LTR_retriever program can be stalled. You need to restart this step if you find the program not progressing for several days. Using EDTA_raw.pl is a good way to have each type of TEs identified in the raw step.

THanks! Shujun

yanarizzieri commented 3 months ago

Hello Shujun,

I have some updates!! So the EDTA2 is still running on the default mode (for around 8 weeks now), and it really seems to be a problem with identifying LINEs.

I tried a couple of things for trouble shooting:

  1. Running EDTA2 with individual chromosomes, instead of the whole assembly
  2. Running EDTA_raw.pl for each type of repeats
  3. Running the old version of EDTA

###################### Results are:

  1. EDTA2 did not seem to have a problem with the actual size of the assembly, as it still got stuck on the identification of the LINEs.

  2. I was able to run EDTA_raw.pl for each type of repeat and build the libraries (LTR, TIR, helitrons and SINE), but when I run EDTA_raw.pl for LINEs, it also gets stuck on the same place as the default run! The run has been going for more than a month! To me, this reinforces that there is something going on with the LINE identification (it could be my genome(?)!)

  3. Old version of EDTA ran fine!!!

Thank you so much for your efforts,

Y

oushujun commented 3 months ago

Hello,

The LINE search was carried out by RepeatModeler. You may try to run RepeatModeler on your own or try with different parameters. Maybe your genome is LINE rich and takes a lot of time. Sometimes, when tandem repeats were mistaken as TEs, the program that identifies such TEs will be stuck in the tandem repeat region. You may try to identify and mask high-confident tandem repeats before running TE annotations. Please let me know how it goes!

Thanks! SHujun

oushujun commented 1 week ago

any luck?

yanarizzieri commented 1 week ago

Hi Shujun,

you have suggested "You may try to identify and mask high-confident tandem repeats before running TE annotations. Please let me know how it goes!" previously and I havent done it yet! I had some other stuff going own, but I will come back to this and see how the genome that I'm working with responds!! I will let you know once I have it!

Y