rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

Can I run ProcessRepeats in parallel? #243

Open life404 opened 5 months ago

life404 commented 5 months ago

What do you want to know? Can I run ProcessRepeats in parallel?

Helpful context

Dear Robert,

I have a 1.7 GB bat genome and conducted RepeatMasker analysis using mammalian repeat records (~295 MB) from Dfam:

~/TOOLS/TETools/TETools.sif famdb.py -i ~/TOOLS/TETools/Libraries families --format fasta_name --include-class-in-name --ancestors --descendants 'Mammalia' > Dfam-Mammalia.fa

~/TOOLS/TETools/TETools.sif RepeatMasker -pa 96 -a -e ncbi -dir . -nolow -lib Dfam-Mammalia.fa -xsmall -gff genome 2>&1 | tee repeatMasker.log

However, I encountered slow analysis with ProcessRepeat, which reads all cat files into memory and performs the analysis in a single thread, taking over 12 hours. To expedite the process, I am considering splitting the genome by chromosomes and running ProcessRepeats in parallel. Is this feasible?

Thank you so much.

rmhubley commented 5 months ago

You could split the *.cat file by sequences and run them independently through ProcessRepeats. We typically run large genomes by splitting them into 50MB chunks and running them through full RepeatMasker runs on a cluster (See RepeatMasker_Nextflow script here: https://github.com/Dfam-consortium/RepeatMasker_Nextflow).

life404 commented 5 months ago

Thank you so much for your advice. I will try it.