lbcb-sci / graphmap2

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html https://www.biorxiv.org/content/10.1101/720458v1
MIT License
66 stars 6 forks source link

Segmentation fault #4

Open ydLiu-HIT opened 5 years ago

ydLiu-HIT commented 5 years ago

Hi!

Recently, I was doing a benchmark for the performance of long spliced reads aligners(minimap2, GMAP, GraphMap2 and deSALT), but when I was running graphmap2 (v.0.6.01) on mouse pacbio SMRT reads and got a segmentation fault when 2.66% reads have been processed using 24 threads, shown in the screenshot:

image

The source of reads which I used can be found in https://www.ncbi.nlm.nih.gov/sra/?term=SRR6238555

How can I prevent segmentation faults?

thanks Michael

mjoppich commented 5 years ago

I can report the very same for the yeast genome ... Reads used: SRR5989373 .

Have you been able to resolve the issue?

If I call it with --ambiguity 0.5 --secondary --min-bin-perc 0.01 --bin-step 0.99 --max-regions 20 --mapq -1 --spliced --chain-min-cov 40 (which according to the help is equivalent), no reads align ...

jmaricb commented 4 years ago

Hi @ydLiu-HIT ,

which reference were you using? Could you send the link or share it?

Thanks

mjoppich commented 4 years ago

Hi @jmaricb

since I got a very similar problem, maybe you could use my case instead:

I used the reads from SRR5989373 together with the ensembl 94 release:

ftp://ftp.ensembl.org/pub/release-94/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.94.gtf.gz

ftp://ftp.ensembl.org/pub/release-94/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz

Thanks for looking into this!

ydLiu-HIT commented 4 years ago

Hi @ydLiu-HIT ,

which reference were you using? Could you send the link or share it?

Thanks

Ensembl genome with version 92 of GRCm38. link: ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

jmaricb commented 4 years ago

@mjoppich @ydLiu-HIT I have located this segmentation fault. It was happening in the ksw2 aligner which would crash for very big references or very big queries. I have added this exception and tried it with this dataset that you linked: https://www.ncbi.nlm.nih.gov/sra/?term=SRR6238555 and it didn't crash. Can you try with the newest commit and let me know if it still crashes.

ydLiu-HIT commented 4 years ago

Hi jmaricb:

I just re-run GraphMap2(v0.6.3) with the same read and reference as before, but it still gets as segmentation fault as follow:

[11:54:03 BuildIndexes] Loading reference sequences. [11:55:35 SetupIndex] Building the index for shape: '11110111101111'. [11:55:55 Create] Allocated memory for a list of 1362768835 seeds (128 bits each) (0.00003 sec, diff: 19.92923 sec). [11:55:55 Create] Memory consumption: [currentRSS = 7801 MB, peakRSS = 7889 MB] [11:55:55 Create] Collecting seeds. [11:55:55 Create] Minimizer seeds will be used. Minimizer window is 5. [12:03:44 Create] [currentRSS = 37193 MB, peakRSS = 49390 MB] Sequence: 44/44, len: 91744698, name: 'chrY''' [12:03:50 Create] Final memory allocation after collecting seeds: [currentRSS = 37694 MB, peakRSS = 49390 MB] [12:03:50 Create] Sorting the seeds using 24 threads. [12:06:33 Create] Generating the hash table. [12:07:01 Create] Calculating the distribution statistics for key counts. [12:07:02 Create] Index statistics: average key count = 132.646856, max key count = 3457358.000000, std dev = 1632.888478, percentil (99.00%) (count cutoff) = 1181.000000 [12:07:31 Create] Memory consumption: [currentRSS = 38466 MB, peakRSS = 49390 MB] [12:07:31 SetupIndex] Finished building index. [12:07:31 SetupIndex_] Storing the index to file: '/data/ydliu/Reference/mouse_GRCm38.fa.gmidx'. [12:13:57 Index] Memory consumption: [currentRSS = 35868 MB, peakRSS = 49390 MB] [12:13:57 Run] Hits will be thresholded at the percentil value (percentil: 99.000000%, frequency: 1181). [12:13:57 Run] Minimizers will be used. Minimizer window length: 5 [12:13:57 Run] Reference genome is assumed to be linear. [12:13:57 Run] One or more similarly good alignments will be output per mapped read. Will be marked secondary. [12:13:57 ProcessReads] All reads will be loaded in memory. [12:14:47 ProcessReads] All reads loaded in 49.92 sec (size around 3144 MB). (3213849871 bases) [12:14:47 ProcessReads] Memory consumption: [currentRSS = 39749 MB, peakRSS = 49390 MB]

[1]+ Segmentation fault (core dumped) ~/software/graphmap2/bin/Linux-x64/graphmap2 align -x rnaseq -r /data/ydliu/Reference/mouse_GRCm38.fa -d /data2/ydliu/ONT_reads/SMRT/mouse/SRR6238555.fasta -o mouse_graphmap2.sam -t 24

ydLiu-HIT commented 4 years ago

The reference I was using: https://drive.google.com/file/d/1OwfUcsJ8iqvuKOR0UJpb1PYRYCl1pLbN/view?usp=sharing

jmaricb commented 4 years ago

I am looking into this right now, again. As I can see from your comment the tool crashed right after loading reads into the memory, before aligning single read, right? Right not that doesn't happen for me. It aligns reads slowly, but it hasn't crash yet. I will try to see what happens and will let you know.