lbcb-sci / graphmap2

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html https://www.biorxiv.org/content/10.1101/720458v1
MIT License
67 stars 6 forks source link

Segmentation Fault on Yeast Transcriptome #11

Open mjoppich opened 4 years ago

mjoppich commented 4 years ago

Hi, I am running graphmap2 on the following data:

Reads: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR598/003/SRR5989373/SRR5989373_1.fastq.gz

Reference GTF: ftp://ftp.ensembl.org/pub/release-94/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.94.gtf.gz

Reference: ftp://ftp.ensembl.org/pub/release-94/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz

Using v0.6.3 (git pull and built today) I still get Segmentation Faults:

/mnt/d/dev/git/graphmap2/bin/Linux-x64/graphmap2 align --rebuild-index -x rnaseq --threads 8 -r /home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.gm2.fa -d fastq/SRR5989373_1.fastq -o graphmap2/SRR5989373_1.unmod.sam
[16:45:22 BuildIndexes] Loading reference sequences.
[16:45:22 SetupIndex_] Building the index for shape: '11110111101111'.
[16:45:22 Create] Allocated memory for a list of 6078553 seeds (128 bits each) (0.00000 sec, diff: 0.03125 sec).
[16:45:22 Create] Memory consumption: [currentRSS = 38 MB, peakRSS = 38 MB]
[16:45:22 Create] Collecting seeds.
[16:45:22 Create] Minimizer seeds will be used. Minimizer window is 5.
[16:45:23 Create] [currentRSS = 170 MB, peakRSS = 223 MB] Sequence: 34/34, len: 1090940, name: 'VII'
[16:45:23 Create] Final memory allocation after collecting seeds: [currentRSS = 176 MB, peakRSS = 223 MB]
[16:45:23 Create] Sorting the seeds using 8 threads.
[16:45:23 Create] Generating the hash table.
[16:45:24 Create] Calculating the distribution statistics for key counts.
[16:45:24 Create] Index statistics: average key count = 2.130736, max key count = 3204.000000, std dev = 3.374890, percentil (99.00%) (count cutoff) = 13.000000
[16:45:25 Create] Memory consumption: [currentRSS = 559 MB, peakRSS = 754 MB]
[16:45:25 SetupIndex_] Finished building index.
[16:45:25 SetupIndex_] Storing the index to file: '/home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.gm2.fa.gmidx'.
[16:45:26 Index] Memory consumption: [currentRSS = 548 MB, peakRSS = 754 MB]
[16:45:26 Run] Hits will be thresholded at the percentil value (percentil: 99.000000%, frequency: 13).
[16:45:26 Run] Minimizers will be used. Minimizer window length: 5
[16:45:26 Run] Reference genome is assumed to be linear.
[16:45:26 Run] One or more similarly good alignments will be output per mapped read. Will be marked secondary.
[16:45:26 ProcessReads] All reads will be loaded in memory.
[16:45:28 ProcessReads] All reads loaded in 2.56 sec (size around 469 MB). (239685513 bases)
[16:45:28 ProcessReads] Memory consumption: [currentRSS = 1101 MB, peakRSS = 1101 MB]
[16:52:26 ProcessReads] [CPU time: 2825.36 sec, RSS: 5221 MB] Read: 76803/241446 (31.81%) [m: 75674, u: 1122], length = 691, qname: SRR5989373.76804 5859c147-15...runAlignment.sh: line 11:  2125 Segmentation fault      (core dumped) /mnt/d/dev/git/graphmap2/bin/Linux-x64/graphmap2 align --rebuild-index -x rnaseq --threads 8 -r $GENOMEGM2 -d $FASTQ -o graphmap2/$BASENAME.unmod.sam

I can also call graphmap2 in a different fashion:

/mnt/d/dev/git/graphmap2/bin/Linux-x64/graphmap2 align --rebuild-index -x rnaseq --gtf /home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.94.gtf --threads 8 -r /home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.gm2.fa -d fastq/SRR5989373_1.fastq -o graphmap2/SRR5989373_1.unmod.sam

Then the segmentation fault occurs earlier:

[16:57:43 BuildIndexes] Loading GTF annotations.
[16:57:43 BuildIndexes] Loading genomic sequences.
[16:57:43 BuildIndexes] Generating the transcriptome.
[16:57:43 GenerateTranscriptomeSeqs] Constructing the transcriptome sequences.
[16:57:43 GenerateTranscriptomeSeqs] In total, there are 7036 transcripts.
[16:57:43 SetupIndex_] Building the index for shape: '11110111101111'.
[16:57:43 Create] Allocated memory for a list of 4428323 seeds (128 bits each) (0.00000 sec, diff: 0.03125 sec).
[16:57:43 Create] Memory consumption: [currentRSS = 33 MB, peakRSS = 33 MB]
[16:57:43 Create] Collecting seeds.
[16:57:43 Create] Minimizer seeds will be used. Minimizer window is 5.
[16:57:47 Create] [currentRSS = 133 MB, peakRSS = 169 MB] Sequence: 14072/14072, len: 483, name: 'YGR296C-B_mRNA_VII'''
[16:57:47 Create] Final memory allocation after collecting seeds: [currentRSS = 133 MB, peakRSS = 169 MB]
[16:57:47 Create] Sorting the seeds using 8 threads.
[16:57:48 Create] Generating the hash table.
[16:57:48 Create] Calculating the distribution statistics for key counts.
[16:57:48 Create] Index statistics: average key count = 1.829742, max key count = 273.000000, std dev = 1.723991, percentil (99.00%) (count cutoff) = 9.000000
[16:57:49 Create] Memory consumption: [currentRSS = 325 MB, peakRSS = 424 MB]
[16:57:49 SetupIndex_] Finished building index.
[16:57:49 SetupIndex_] Storing the index to file: '/home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.gm2.fa.gmidx'.
[16:57:49 Index] Memory consumption: [currentRSS = 325 MB, peakRSS = 424 MB]
[16:57:49 Run] Hits will be thresholded at the percentil value (percentil: 99.000000%, frequency: 9).
[16:57:49 Run] Minimizers will be used. Minimizer window length: 5
[16:57:49 Run] Reference genome is assumed to be linear.
[16:57:49 Run] One or more similarly good alignments will be output per mapped read. Will be marked secondary.
[16:57:49 ProcessReads] All reads will be loaded in memory.
[16:57:52 ProcessReads] All reads loaded in 2.92 sec (size around 469 MB). (239685513 bases)
[16:57:52 ProcessReads] Memory consumption: [currentRSS = 868 MB, peakRSS = 868 MB]
[16:57:53 ProcessReads] [CPU time: 8.52 sec, RSS: 960 MB] Read: 180/241446 (0.07%) [m: 93, u: 80], length = 1743, qname: SRR5989373.181 6b89fcff-b792-4f6a-a00f-...Segmentation fault (core dumped)

I'd be more than happy if you could look into this issue. Given the few introns, the yeast genome might be useful for debugging :)

jmaricb commented 4 years ago

I'm looking into this now. Will let you know when I find the error.

jmaricb commented 4 years ago

@mjoppich I have tried running the same command you did: .graphmap2/bin/Linux-x64/graphmap2 align --rebuild-index -x rnaseq --threads 8 -r Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.gm2.fa -d SRR5989373_1.fastq -o SRR5989373_1.unmod.sam

and I didn't get segmentation fault.

This is the .sam file it generated: https://www.dropbox.com/s/q6vsdbukzfanrue/SRR5989373_1.unmod.sam.gz You should be able to download it.

Now, it could be something nondeterministic, I will try to see what could be wrong. If you have anything more useful for me let me know.

mjoppich commented 4 years ago

Thanks for your help!

I built graphmap2 with debug flags make -j4 debug and run the whole thing in gdb.

/mnt/d/dev/git/graphmap2/bin/graphmap-debug align --rebuild-index -x rnaseq --threads 8 -r /home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.gm2.fa --gtf /home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.94.gtf -d fastq/SRR5989373_1.fastq -o graphmap2/SRR5989373_1.unmod.sam

[07:58:20 ProcessReads] [CPU time: 15.72 sec, RSS: 968 MB] Read: 202/241446 (0.08%) [m: 100, u: 95], length = 1990, qname: SRR5989373.203 88f29c93-2f23-49b6-8ae...
Thread 3 "graphmap-debug" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffffcad0700 (LWP 8620)]
0x000000000809b520 in std::operator<< <std::char_traits<char> > (__c=<optimized out>, __out=...) at /usr/include/c++/7/ostream:509
509         { return __ostream_insert(__out, &__c, 1); }

(gdb) where
#0  0x000000000809b520 in std::operator<< <std::char_traits<char> > (__c=<optimized out>, __out=...) at /usr/include/c++/7/ostream:509
#1  std::operator<< <std::char_traits<char> > (__c=<optimized out>, __out=...) at /usr/include/c++/7/ostream:515
#2  AlignmentToMD[abi:cxx11](std::vector<unsigned char, std::allocator<unsigned char> >&, signed char const*, long) (alignment=...,
    ref_data=0x7ffffd2f0010 "ATGCTACGTATATACCACTCTCAACTTACCCTACTCTCACATTCCACTCCATGGCCCAGTCTCACTAAATCAGTACGATGCACTCACATCATTATTCACGGCACTTGCCTCAGCGGTTTATACCCTGTGCAATTTACCCATAAAACCCACGATTATCCACATTTTAATATCTATATCTCATTCAGCGGCTCCAAATATTG"...,  
    alignment_position_start=-119) at src/alignment/cigargen.cc:674
#3  0x00000000081125a7 in HackIntermediateMapping (mapping_data=mapping_data@entry=0x7fffe40b4d20, index=..., read=read@entry=0x9025bb0, parameters=parameters@entry=0x7ffffffed5e0, abs_ref_id=abs_ref_id@entry=1529, aln_result=...,      
    score=0x7ffffcacf1c8) at src/graphmap/process_read.cc:802
#4  0x0000000008119293 in GraphMap::RNAGenerateAlignments_ (this=this@entry=0x7ffffffed430, order_number=order_number@entry=187, mapping_data=mapping_data@entry=0x7fffe40b4d20, index=..., transcriptome=..., read=read@entry=0x9025bb0,   
    parameters=0x7ffffffed5e0, evalue_params=0x8a95450, realignment_structures=0x7ffffffecec0) at src/graphmap/process_read.cc:1047
#5  0x000000000811c185 in GraphMap::ProcessRead (this=this@entry=0x7ffffffed430, order_number=order_number@entry=187, mapping_data=mapping_data@entry=0x7fffe40b4d20, read=0x9025bb0, parameters=parameters@entry=0x7ffffffed5e0,
    evalue_params=evalue_params@entry=0x8a95450, realignment_structures=<optimized out>) at src/graphmap/process_read.cc:252
#6  0x0000000008136f07 in GraphMap::ProcessSequenceFileInParallel (this=<optimized out>, parameters=<optimized out>, reads=<optimized out>, last_time=<optimized out>, fp_out=<optimized out>, ret_num_mapped=<optimized out>,
    ret_num_unmapped=0x0) at src/graphmap/graphmap.cc:1333
#7  0x00007fffff1e695e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#8  0x00007ffffe3e76db in start_thread (arg=0x7ffffcad0700) at pthread_create.c:463
#9  0x00007ffffe93188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) frame 2
#2  AlignmentToMD[abi:cxx11](std::vector<unsigned char, std::allocator<unsigned char> >&, signed char const*, long) (alignment=..., 
    ref_data=0x7ffffd2f0010 "ATGCTACGTATATACCACTCTCAACTTACCCTACTCTCACATTCCACTCCATGGCCCAGTCTCACTAAATCAGTACGATGCACTCACATCATTATTCACGGCACTTGCCTCAGCGGTTTATACCCTGTGCAATTTACCCATAAAACCCACGATTATCCACATTTTAATATCTATATCTCATTCAGCGGCTCCAAATATTG"...,  
    alignment_position_start=-119) at src/alignment/cigargen.cc:674
674             md << ref_data[ref_position + j];
(gdb) list
669             }
670           }
671         } else if (cigar_array[i].op == 'D' || cigar_array[i].op == 'N') {
672           md << '^';
673           for (int32_t j=0; j<cigar_array[i].count; j++) {
674             md << ref_data[ref_position + j];
675           }
676         }
677
678         if ((i + 1) < cigar_array.size() && cigar_array[i].op != '=' && cigar_array[i+1].op != '=') {
gcc --version
gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.

Does that help you?

mjoppich commented 4 years ago

Other question: which branch are you on? I am using the master branch (or whatever else is the default branch).

jmaricb commented 4 years ago

I am using master branch, the default one. Thank you for the help, I can see that, when generating MD string in the alignment generation, the program tries to access the negative index in the array.

Someone already reported similar issue, but on different place and it happened because his gtf file was not correct.

Now, I also ran the program in gcd and I don't get segmentation fault, but some undefined behaviour obviously is happening.

I will look into this MS string generation and I will make sure that it doesn't break. But, can you show me the stack trace you get in gdb when running without gtf file? Also are you sure that the gtf file is correct? I tried to evaluate the produced .sam file with https://github.com/lbcb-sci/RNAseqEval with this command: "python ../../RNAseqEval-newesrt0/RNAseqEval/RNAseqEval.py eval-mapping ../../yeastBug/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa ../../yeastBug/SRR5989373_1.unmod.sam -a ../../yeastBug/Saccharomyces_cerevisiae.R64-1-1.94.gtf --no_check_strand"

and i get the error: ERROR: Duplicate chromosome name: chrchromosome

mjoppich commented 4 years ago

The GTF file is downloaded from ensembl. I would guess they know how to make a GTF file ...

The stacktrace for running without the gtf file:

[17:55:30 ProcessReads] [CPU time: 2478.91 sec, RSS: 5167 MB] Read: 76839/241446 (31.82%) [m: 75709, u: 1123], length = 943, qname: SRR5989373.76840 36d1f278-b0...
Thread 5 "graphmap-debug" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffffa9d0700 (LWP 11235)]
0x000000000809b520 in std::operator<< <std::char_traits<char> > (__c=<optimized out>, __out=...) at /usr/include/c++/7/ostream:509
509         { return __ostream_insert(__out, &__c, 1); }
(gdb) where
#0  0x000000000809b520 in std::operator<< <std::char_traits<char> > (__c=<optimized out>, __out=...) at /usr/include/c++/7/ostream:509
#1  std::operator<< <std::char_traits<char> > (__c=<optimized out>, __out=...) at /usr/include/c++/7/ostream:515
#2  AlignmentToMD[abi:cxx11](std::vector<unsigned char, std::allocator<unsigned char> >&, signed char const*, long) (alignment=..., 
    ref_data=0x7ffffc210010 "CACCACACCCACACACCACACCCACACACACACCACACCCACACACCACACCCACACACCACACCCACTACTCTAACCCTATTCTAATCCAACCCTGATCAACCTGTCTCCAAACCTACCCTCACATTACCCTACCTCTCCACTCGTTACCCTGCCCCACTCAACCATACCACTCCCACCCACCATCCATCTCTCTACTG"...,  
    alignment_position_start=-21396) at src/alignment/cigargen.cc:674
#3  0x00000000081125a7 in HackIntermediateMapping (mapping_data=mapping_data@entry=0x7ffe3112b9a0, index=..., read=read@entry=0x13326be0, parameters=parameters@entry=0x7ffffffed640, abs_ref_id=abs_ref_id@entry=31, aln_result=...,       
    score=0x7ffffa9cf1c8) at src/graphmap/process_read.cc:802
#4  0x0000000008119293 in GraphMap::RNAGenerateAlignments_ (this=this@entry=0x7ffffffed490, order_number=order_number@entry=76802, mapping_data=mapping_data@entry=0x7ffe3112b9a0, index=..., transcriptome=...,
    read=read@entry=0x13326be0, parameters=0x7ffffffed640, evalue_params=0x86a1600, realignment_structures=0x7ffffffecf20) at src/graphmap/process_read.cc:1047
#5  0x000000000811c185 in GraphMap::ProcessRead (this=this@entry=0x7ffffffed490, order_number=order_number@entry=76802, mapping_data=mapping_data@entry=0x7ffe3112b9a0, read=0x13326be0, parameters=parameters@entry=0x7ffffffed640, 
    evalue_params=evalue_params@entry=0x86a1600, realignment_structures=<optimized out>) at src/graphmap/process_read.cc:252
#6  0x0000000008136f07 in GraphMap::ProcessSequenceFileInParallel (this=<optimized out>, parameters=<optimized out>, reads=<optimized out>, last_time=<optimized out>, fp_out=<optimized out>, ret_num_mapped=<optimized out>,
    ret_num_unmapped=0x0) at src/graphmap/graphmap.cc:1333
#7  0x00007fffff1e695e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#8  0x00007ffffe3e76db in start_thread (arg=0x7ffffa9d0700) at pthread_create.c:463
#9  0x00007ffffe93188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Not having a GTF file and still getting the error in this very same location sounds like a more general problem instead of a GTF problem ;)

The outfile remains empty btw ;)

jmaricb commented 4 years ago

The generation of transcriptome from GTF file is not working. I have to fix that. But if you run the program without --gtf and with --rebuild-index (to overwrite the index generated with gtf file) it should work, ti does for me.

mjoppich commented 4 years ago

Hi,

as mentioned in my last post:

The stacktrace for running without the gtf file:

And indeed, I always run with --rebuild-index.

The following call results in the above segfault (the one from 17:55:30):

/mnt/d/dev/git/graphmap2/bin/graphmap-debug align --rebuild-index -x rnaseq --threads 8 -r /home/mjoppich/dev/data/genomes/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.gm2.fa -d fastq/SRR5989373_1.fastq -o graphmap2/SRR5989373_1.unmod.sam

Actually my feeling is that the problem lies in the -x rnaseq setting - do you also run with this setting?

jmaricb commented 4 years ago

Hi,

yes I do, I run with -x rnaseq.

I understand everything you are saying, but I don't know why you are getting segmentation fault when running without gtf and with --rebuild-index. Because, as I said, i don't get the segmentation fault in that case, both in debug and release.

I used the reference and the dataset you provided in your first post mode and I the program finishes with the results I posted above.

PS: It is true that you don't have to (or even shoudn't) run the program with -x rnaseq if you use gtf file because the generated index will not contain introns and then there is no need to use -x rnaseq.