Open life404 opened 5 months ago
You could split the *.cat file by sequences and run them independently through ProcessRepeats. We typically run large genomes by splitting them into 50MB chunks and running them through full RepeatMasker runs on a cluster (See RepeatMasker_Nextflow script here: https://github.com/Dfam-consortium/RepeatMasker_Nextflow).
Thank you so much for your advice. I will try it.
What do you want to know? Can I run
ProcessRepeats
in parallel?Helpful context
Dear Robert,
I have a 1.7 GB bat genome and conducted RepeatMasker analysis using mammalian repeat records (~295 MB) from Dfam:
However, I encountered slow analysis with ProcessRepeat, which reads all cat files into memory and performs the analysis in a single thread, taking over 12 hours. To expedite the process, I am considering splitting the genome by chromosomes and running ProcessRepeats in parallel. Is this feasible?
Thank you so much.