thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
138 stars 61 forks source link

reducing memory usage #74

Closed mictadlo closed 6 years ago

mictadlo commented 6 years ago

Hi, I allocated for each job 80 Gb of memory but it was not enough. Is there a way to reduce the memory consumption. The input fasta was 100 Gb big and contains so many lines:

wc -l reads.fasta 
1338537478 reads.fasta

This is how I ran DALIGNER:

DBsplit -x500 -s400 bananaDB
HPC.daligner bananaDB -mdust -H6973 -ftest -T4 
sh HPC.daligner_pbs.sh test.01.OVL

HPC.daligner_pbs.sh:

#!/bin/bash
# HPC.daligner DB -mdust -H6973 -ftest -T4 
# sh HPC.daligner_pbs.sh test.01.OVL
while IFS='' read -r line || [[ -n "$line" ]]; do
  cmd=$line 

  #cat <<EOF
  qsub <<EOF
#!/bin/bash -l

#PBS -N HPCdaligner
#PBS -l walltime=48:00:00
#PBS -j oe
#PBS -l mem=80G
#PBS -l ncpus=4
#PBS -M m.lorenc@qut.edu.au
###PBS -m bea

cd \$PBS_O_WORKDIR

$cmd

EOF

done < "$1"

Is there a way to reduce the memory consumption?

Thank you in advance.

Michal

thegenemyers commented 6 years ago

Hi MIchal,

I don't see where or how there is a memory problem.  You say that 

80Gb of memory is not enough, but how exactly is that manifesting in a defective or broken process, please advise?

If your genome is highly repetitive, then without repeat masking 

(and please see my other response where I indicate you should repeat mask first), almost all local alignments found are for repeat-induced alignments and even with 80Gb I can see many alignments getting missed or the effective -t cutoff being too small (is this what happened?). You must repeat mask first.

I would break the DB into somewhat smaller blocks, say -s250 albeit 

that will not solve your issue per se. I typically do that and run with 16Gb of memory which if repeat masking has been performed is more than enough.

Hope that helps,  Gene

On 1/16/18, 5:28 AM, Michał T. Lorenc wrote:

Hi, I allocated for each job 80 Gb of memory but it was not enough. Is there a way to reduce the memory consumption. The input fasta was 100 Gb big and contains so many lines:

wc -l reads.fasta 1338537478 reads.fasta

This is how I ran DALIGNER:

DBsplit -x500 -s400 bananaDB HPC.daligner bananaDB -mdust -H6973 -ftest -T4 sh HPC.daligner_pbs.sh test.01.OVL

HPC.daligner_pbs.sh:

|#!/bin/bash

HPC.daligner DB -mdust -H6973 -ftest -T4

sh HPC.daligner_pbs.sh test.01.OVL

while IFS='' read -r line || [[ -n "$line" ]]; do cmd=$line

cat <<EOF

qsub <<EOF

!/bin/bash -l

PBS -N HPCdaligner

PBS -l walltime=48:00:00

PBS -j oe

PBS -l mem=80G

PBS -l ncpus=4

PBS -M m.lorenc@qut.edu.au

PBS -m bea

cd \$PBS_O_WORKDIR

$cmd

EOF

done < "$1" |

Is there a way to reduce the memory consumption?

Thank you in advance.

Michal

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/74, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNjjGd7dn9ICPpUVC3TV11rr-sxtyks5tLCVwgaJpZM4RfPAM.

mictadlo commented 6 years ago

Hi Gene, Thank you repeat masking first and -s250 has reduced the memory consumption to 21GB.

Best wishes,

Michal