Open jaclyn-taroni opened 5 years ago
Would it be possible to re-run mashmap with /usr/bin/time
utility to report its memory usage. Comparing the peak memory-usage with the RAM size would help. My first guess is that it's running out of memory with the parameters --pi 80 -s 500
Hi,
I got the same error when running in a cluster, and the job was killed by the scheduler becaouse of memory and it showed the same error.
@cjain7, do you know how much memory it needs to run this kind of alignments? it would be the transcriptome against the genome?
I set up the limit to 200GB and it wasn't enough.
Thanks!
I've just finished running it on human gencode data and annotation. It took ~80G of memory for me to finish.
Thank you for testing that!
I was using ensembl and maybe that is the difference. Can I ask for the tool version and the size of the files ( maybe number of characters ) so I can compare with my transcriptome? thanks so much!!!
On June 6, 2019 at 17:38:35, Avi Srivastava (notifications@github.com) wrote:
I've just finished running it on human gencode data and annotation. It took ~80G of memory for me to finish.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marbl/MashMap/issues/21?email_source=notifications&email_token=AAML6HDKUTYLWNSH7MFXBGLPZF7VXA5CNFSM4HTZ5JQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXEHTIA#issuecomment-499677600, or mute the thread https://github.com/notifications/unsubscribe-auth/AAML6HDULGFDEWA4LW6SQ3DPZF7VXANCNFSM4HTZ5JQQ .
No problem @lpantano .
Not to swarm the issue with salmon related files but gentrome.fa
for genocde human comes out to be around 477 MB
while ensembl one is around 431 MB
. If you are looking for human ensembl decoys, we have uploaded them here. You can also follow up or raise a request for creating decoys for non model organism here https://github.com/COMBINE-lab/SalmonTools/issues/5 , we would be happy to create that for you.
Thanks all for the replies. I am out of the office today, but I will run this with GNU time when I get back in early next week and see if that gives us any additional insight.
@k3yavi , thanks. All good, it was enough 100GB, I messed up the configuration, sorry about that, but good to know about the resources, thanks so much for your time!!! really appreciate the help!
Hi @cjain7,
When I run /usr/bin/time
with --verbose
, the output is:
Command terminated by signal 11
Command being timed: "mashmap -r reference.masked.genome.fa -q Homo_sapiens.GRCh38.cdna.all.fa -t 8 --pi 80 -s 500"
User time (seconds): 1269.39
System time (seconds): 64.50
Percent of CPU this job got: 273%
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:07.51
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 48309816
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 56777549
Voluntary context switches: 195530
Involuntary context switches: 332377
Swaps: 0
File system inputs: 106068536
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Thank you!
Guys, you claim " Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, ", why then it should take >64GB RAM to make human alignment in salmon tools decoy script with Masmap?
The performance is highly dependent on the length [-s]
and identity [--pi]
requirements provided to Mashmap...
When looking for long approximate matches that are highly similar, the algorithm would compute sparse LSH sketch to execute the computation. This was the case when comparing two human genome assemblies (--pi 95 -s 5000
).
When looking for short divergent matches (--pi 80 -s 500
, i.e., segment length 500 and 20% error rate here in your application), it will need dense sketch to identify those. Hence large memory-use and runtime in your specific case.. (Mashmap paper is a good reference for a verbose discussion on this).
One possible suggestion is to see if relaxing (i.e., increasing) the minimum identity/length requirements makes sense for the application.. If it is do-able, then the algorithm will execute much faster, with much less memory.
The other way-around this problem would be to partition the reference into smaller chunks, and run those independently, but this pipeline will require a bit more engineering to aggregate the results..
For context: I am attempting to create an augmented FASTA file to add decoy sequence to a Salmon index as noted in the release notes in the most recent version of Salmon (
0.14.0
): https://github.com/COMBINE-lab/salmon/releases/tag/v0.14.0The authors provide a script that makes use of MashMap to do so here: https://github.com/COMBINE-lab/SalmonTools/blob/master/scripts/generateDecoyTranscriptome.sh
I get
Segmentation fault (core dumped)
when the script reaches the MashMap step at this line https://github.com/COMBINE-lab/SalmonTools/blob/23eac847decf601c345abd8527eed5dc1b382573/scripts/generateDecoyTranscriptome.sh#L105This can be reproduced from the command line:
Where the relevant input to
generateDecoyTranscriptome.sh
to generatereference.masked.genome.fa
and the transcript fasta are:I'm using a Docker image with the
v2.0
release of MashMap. (It can be pulled fromjtaroni/2019-chi-training
and MashMap is installed like so: https://github.com/AlexsLemonade/RNA-Seq-Exercises/blob/d6e5f8627c75e55e572e9061f0498388ebb7d212/Dockerfile#L91).This also occurs running on my Ubuntu 18.04 machine w/ 64GB RAM outside the container.
Any ideas about what may be happening would be appreciated. Thank you!