rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

How to reduce memory cost? #261

Open yhli1992 opened 2 weeks ago

yhli1992 commented 2 weeks ago

Memory cost is too high to run repeatmasker (80G ram cost per parallel job)

image

command line RepeatMasker -pa 5 -qq -a -html -poly -gff -species 'mus musculus' -e rmblast -html -gff ./repeat/sample_S1_R1.fa -dir ./sample_S1/

info

Search Engine: NCBI/RMBLAST [ 2.14.1+ ]

Using Master RepeatMasker Database: /home/yhli/software/RepeatMasker/Libraries/famdb
  Title    : Dfam withRBRM
  Version  : 3.8
  Date     : 2023-11-14
  Families : 308,177

Species/Taxa Search:
  Mus musculus [NCBI Taxonomy ID: 10090]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa;
           Eumetazoa;Bilateria;Deuterostomia;Chordata;
           Craniata <chordates>;Vertebrata <vertebrates>;
           Gnathostomata <vertebrates>;Teleostomi;Euteleostomi;
           Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;
           Mammalia;Theria <mammals>;Eutheria;Boreoeutheria;
           Euarchontoglires;Glires;Rodentia;Myomorpha;Muroidea;Muridae
Including only curated families:
  1334 families in ancestor taxa; 28 lineage-specific families
yhli1992 commented 2 weeks ago

Time cost is also too high, which only finished 90/82786 after 2 hours with -pa 5 and -qq

yhli1992 commented 2 weeks ago

Is that loaded the whole Dfam file (~66.7G) to RAM for each core and mapping to all species sequence?

rmhubley commented 1 week ago

That's really bizarre. When I run your same exact command line (with a 1MB mouse sequence) I never see more than 35KB memory use for each RepeatMasker thread. I get the same database creation message:

RepeatMasker version 4.1.6
Search Engine: NCBI/RMBLAST [ 2.14.1+ ]

Using Master RepeatMasker Database: /home/rhubley/projects/RepeatMasker/Libraries/famdb
  Title    : Dfam withRBRM
  Version  : 3.8
  Date     : 2023-11-14
  Families : 3,618,939

Species/Taxa Search:
  Mus musculus [NCBI Taxonomy ID: 10090]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa;
           Eumetazoa;Bilateria;Deuterostomia;Chordata;
           Craniata <chordates>;Vertebrata <vertebrates>;
           Gnathostomata <vertebrates>;Teleostomi;Euteleostomi;
           Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;
           Mammalia;Theria <mammals>;Eutheria;Boreoeutheria;
           Euarchontoglires;Glires;Rodentia;Myomorpha;Muroidea;Muridae
Including only curated families:
  1334 families in ancestor taxa; 28 lineage-specific families

Building species libraries in: /home/rhubley/projects/RepeatMasker/Libraries/CONS-Dfam_withRBRM_3.8/mus_musculus

Albeit, for my installation I have all Dfam partitions ( Families : 3,618,939 vs 308,177 ). When RepeatMasker is run with the "-species" option it extracts only the relevant families from the FamDB Dfam partitions to search with. Here that is 1,362 families and it indicates where it cached the files for this and future runs. One thing to check is that it extracted only 1,362 families. E.g for my run:

% cd /home/rhubley/projects/RepeatMasker/Libraries/CONS-Dfam_withRBRM_3.8/mus_musculus
% fgrep -c ">" specieslib
1390
% <path_to_repeatmasker>/famdb.py families -ad "mus musculus" | grep -c "DF00" 
1362

The specieslib is what is being compared to 60kb batches of your input sequence. In this case 1390 families. You may note that this is slightly higher than what was reported by RepeatMasker ( 1334 + 28 = 1362 families ). This is incorrectly not counting the RepBase families that were also included, as we have both Dfam and Repbase configured for this installation -- we'll have to fix that. In any case, this is a trivial amount of sequence to align and should not be using that much memory.

Since that memory usage number is so astounding, the only thing I can think to check (other than size of the specieslib ), is your rmblast installation. I am not sure what flavor of unix you are using or the kernel version, but I have seen strange incompatibilities with pre-compiled binaries. It might be worth pulling down the source for rmblast and compiling it on this system ( http://www.repeatmasker.org/rmblast/ ). Also, take a quick look at your input sequences to make sure that there is nothing strange about the way sample_S1_R1.fa is formatted. Just to give you a comparison, running 1MB of sequence with this same command line on our system took 12 seconds to complete.

And if convenient, I am happy to try running on your input file to see if I can reproduce the problem using that alone.

yhli1992 commented 1 week ago

@rmhubley Thanks for your reply.

According to my test, it seems to be caused by defaultly extended 2k sequence. The usage of RAM reached to 80G in analyzing fasta file step.

RepeatMasker transparently fragments large sequences in fragments of 60 kb with 2 kb overlaps.

My input fasta file is WES data, only using R1 part, 7.5G size, 150 lenth per sequence, transformed from bam file using samtools fasta. Because this data is unreleased, I can only give you the head of this fasta file.

>LH00410:67:223CCHLT4:5:1145:34777:21087
GCTGCGGGGAGCAGCCGTACGGGACATAGCTGTCCCGTGCGCAGAGACCCAGGGCCGCGCTTCCCTCGCAGAAACCGCAGCGCGATGGCCCGCTGACTGCGGCCATGCAGGCTTGAGCGGACTCCCGGCACACGCAGGGGACAACCACGGG
>LH00410:67:223CCHLT4:5:1262:40620:3746
GCGGGGAGCAGCCGTACGGGACATAGCTGTCCCGTGCGCAGAGACCCAGGGCCGCGCTTCCCTCGCAGAAACCGCAGCGCGATGGCCCGCTGACTGCGGCCATGCAGGCTTGAGCGGACTCCCGGCACACGCAGGGGACAACCACGGGCGC
>LH00410:67:223CCHLT4:5:2478:32705:27391
GCGGGGAGCAGCCGTACGGGACATAGCTGTCCCGTGCGCAGAGACCCAGGGCCGCGCTTCCCTCGCAGAAACCGCAGCGCGATGGCCCGCTGACTGCGGCCATGCAGGCTTGAGCGGACTCCCGGCACACGCAGGGGACAACCACGGGCGC
>LH00410:67:223CCHLT4:5:2348:44981:26284
GCCGTACGGGACATAGCTGTCCCGTGCGCAGAGACCCAGGGCCGCGCTTCCCTCGCAGAAACCGCAGCGCGATGGCCCGCTGACTGCGGCCATGCAGGCTTGAGCGGACTCCCGGCACACGCAGGGGACAACCACGGGCGCAGGCTGCAGG
>LH00410:67:223CCHLT4:5:1353:28724:18230
GGACATAGCTGTCCCGTGCGCAGAGACCCAGGGCCGCGCTTCCCTCGCAGAAACCGCAGCGCGATGGCCCGCTGACTGCGGCCATGCAGGCTTGAGCGGACTCCCGGCACACGCAGTGGACAACCACGGGCGCAGGCTGCAGGACTCCTGT

By the way, if I use the R1+R2 fasta, it takes over 12 hours for checking duplicated fasta sequence name. So, I just use R1 as the input, which costs 20~30 min in analyzing fasta file step.