lh3 / fermikit

De novo assembly based variant calling pipeline for Illumina short reads
Other
107 stars 22 forks source link

High RAM usage on simulated samples #9

Open morgantaschuk opened 9 years ago

morgantaschuk commented 9 years ago

Hi,

I have 50x simulated 1000g data with ART (http://www.niehs.nih.gov/research/resources/software/biostatistics/art/). I'm trying to run fermikit on this data and our cluster is killing the job when it passes 170G RAM. Do you have any suggestions for decreasing memory usage?

fermi.kit/fermi2.pl unitig -s3g -t16 -l126 -p art_50x_fermikit "fermi.kit/seqtk mergepe NA12877_50x_1.fq.gz NA12877_50x_2.fq.gz" > art_50x_fermikit.mak
make -f art_50x_fermikit.mak
fermi.kit/run-calling -t16 reference/hg19_random.fa art_50x_fermikit.mag.gz | sh

The only thing in the log is the following:

bash -c '/u/mtaschuk/git/fermikit/fermi.kit/bfc -s 3g -t 16 <(~/git/fermikit/fermi.kit/seqtk mergepe NA12877_50x_1.fq.gz NA12877_50x_2.fq.gz) <(~/git/fermikit/fermi.kit/seqtk mergepe NA12877_50x_1.fq.gz NA12877_50x_2.fq.gz) 2> art_50x_fermikit.ec.fq.gz.log | gzip -1 > art_50x_fermikit.ec.fq.gz'

I'm using NA12877 vcf from GiaB, converted to fasta reference, and then simulating using ART with the following characteristics:

lh3 commented 9 years ago

Simulators usually generate reads with much higher error rate. The peak memory of the error corrector is sensitive to the error rate. This is an issue with fermikit.