marbl / MashMap

A fast approximate aligner for long DNA sequences
Other
269 stars 39 forks source link

Segmentation fault (core dumped) #21

Open jaclyn-taroni opened 5 years ago

jaclyn-taroni commented 5 years ago

For context: I am attempting to create an augmented FASTA file to add decoy sequence to a Salmon index as noted in the release notes in the most recent version of Salmon (0.14.0): https://github.com/COMBINE-lab/salmon/releases/tag/v0.14.0

The authors provide a script that makes use of MashMap to do so here: https://github.com/COMBINE-lab/SalmonTools/blob/master/scripts/generateDecoyTranscriptome.sh

I get Segmentation fault (core dumped) when the script reaches the MashMap step at this line https://github.com/COMBINE-lab/SalmonTools/blob/23eac847decf601c345abd8527eed5dc1b382573/scripts/generateDecoyTranscriptome.sh#L105

This can be reproduced from the command line:

mashmap -r reference.masked.genome.fa -q Homo_sapiens.GRCh38.cdna.all.fa -t 8 --pi 80 -s 500
>>>>>>>>>>>>>>>>>>
Reference = [reference.masked.genome.fa]
Query = [Homo_sapiens.GRCh38.cdna.all.fa]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 8
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 985533927
Segmentation fault (core dumped)

Where the relevant input to generateDecoyTranscriptome.sh to generate reference.masked.genome.fa and the transcript fasta are:

Input File Download
GTF ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz
Genome FASTA ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
Transcript FASTA ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

I'm using a Docker image with the v2.0 release of MashMap. (It can be pulled from jtaroni/2019-chi-training and MashMap is installed like so: https://github.com/AlexsLemonade/RNA-Seq-Exercises/blob/d6e5f8627c75e55e572e9061f0498388ebb7d212/Dockerfile#L91).

This also occurs running on my Ubuntu 18.04 machine w/ 64GB RAM outside the container.

Any ideas about what may be happening would be appreciated. Thank you!

cjain7 commented 5 years ago

Would it be possible to re-run mashmap with /usr/bin/time utility to report its memory usage. Comparing the peak memory-usage with the RAM size would help. My first guess is that it's running out of memory with the parameters --pi 80 -s 500

lpantano commented 5 years ago

Hi,

I got the same error when running in a cluster, and the job was killed by the scheduler becaouse of memory and it showed the same error.

@cjain7, do you know how much memory it needs to run this kind of alignments? it would be the transcriptome against the genome?

I set up the limit to 200GB and it wasn't enough.

Thanks!

k3yavi commented 5 years ago

I've just finished running it on human gencode data and annotation. It took ~80G of memory for me to finish.

lpantano commented 5 years ago

Thank you for testing that!

I was using ensembl and maybe that is the difference. Can I ask for the tool version and the size of the files ( maybe number of characters ) so I can compare with my transcriptome? thanks so much!!!

On June 6, 2019 at 17:38:35, Avi Srivastava (notifications@github.com) wrote:

I've just finished running it on human gencode data and annotation. It took ~80G of memory for me to finish.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marbl/MashMap/issues/21?email_source=notifications&email_token=AAML6HDKUTYLWNSH7MFXBGLPZF7VXA5CNFSM4HTZ5JQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXEHTIA#issuecomment-499677600, or mute the thread https://github.com/notifications/unsubscribe-auth/AAML6HDULGFDEWA4LW6SQ3DPZF7VXANCNFSM4HTZ5JQQ .

k3yavi commented 5 years ago

No problem @lpantano . Not to swarm the issue with salmon related files but gentrome.fa for genocde human comes out to be around 477 MB while ensembl one is around 431 MB. If you are looking for human ensembl decoys, we have uploaded them here. You can also follow up or raise a request for creating decoys for non model organism here https://github.com/COMBINE-lab/SalmonTools/issues/5 , we would be happy to create that for you.

jaclyn-taroni commented 5 years ago

Thanks all for the replies. I am out of the office today, but I will run this with GNU time when I get back in early next week and see if that gives us any additional insight.

lpantano commented 5 years ago

@k3yavi , thanks. All good, it was enough 100GB, I messed up the configuration, sorry about that, but good to know about the resources, thanks so much for your time!!! really appreciate the help!

jaclyn-taroni commented 5 years ago

Hi @cjain7,

When I run /usr/bin/time with --verbose, the output is:

Command terminated by signal 11
    Command being timed: "mashmap -r reference.masked.genome.fa -q Homo_sapiens.GRCh38.cdna.all.fa -t 8 --pi 80 -s 500"
    User time (seconds): 1269.39
    System time (seconds): 64.50
    Percent of CPU this job got: 273%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 8:07.51
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 48309816
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 56777549
    Voluntary context switches: 195530
    Involuntary context switches: 332377
    Swaps: 0
    File system inputs: 106068536
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Thank you!

antonkulaga commented 5 years ago

Guys, you claim " Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, ", why then it should take >64GB RAM to make human alignment in salmon tools decoy script with Masmap?

cjain7 commented 5 years ago

The performance is highly dependent on the length [-s] and identity [--pi] requirements provided to Mashmap... When looking for long approximate matches that are highly similar, the algorithm would compute sparse LSH sketch to execute the computation. This was the case when comparing two human genome assemblies (--pi 95 -s 5000).

When looking for short divergent matches (--pi 80 -s 500, i.e., segment length 500 and 20% error rate here in your application), it will need dense sketch to identify those. Hence large memory-use and runtime in your specific case.. (Mashmap paper is a good reference for a verbose discussion on this).

One possible suggestion is to see if relaxing (i.e., increasing) the minimum identity/length requirements makes sense for the application.. If it is do-able, then the algorithm will execute much faster, with much less memory.

The other way-around this problem would be to partition the reference into smaller chunks, and run those independently, but this pipeline will require a bit more engineering to aggregate the results..