soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.38k stars 195 forks source link

Segmentation fault when running convertalis with -qaln -- alignment is walking off the end #863

Open ifiddes opened 3 months ago

ifiddes commented 3 months ago

GDB showed me I get a segmentation fault here

    seq=0x7ffff789709c "TATTTTATTTTGTGTAGAGATGGGGTCTCACTAGGTTGCC\n",
    offset=39, bt=..., reverse=false, isReverseStrand=true,
    translateSequence=<optimized out>, translateNucl=...)

With offset = 39, and seqPos = 40, and isReverseStrand = true, the line of code is walking off the start of this 40bp long sequence.

This seems to be because the backtrace has a length of 41:

(gdb) print bt
$6 = (const std::__1::string &) @0x7fffffff2c70: {
  static __endian_factor = 2,
  __r_ = {<std::__1::__compressed_pair_elem<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__rep, 0, false>> = {__value_ = {{__s = {{__is_long_ = 1 '\001',
              __size_ = 24 '\030'}, __padding_ = 0x7fffffff2c71 "",
            __data_ = "\000\000\000\000\000\000\000)\000\000\000\000\000\000\000\340E\350VUU\000"}, __l = {{__is_long_ = 1, __cap_ = 24},
            __size_ = 41,
            __data_ = 0x555556e845e0 'M' <repeats 27 times>, "I", 'M' <repeats 12 times>, "D"}, __r = {__words = {49, 41,
              93825018643936}}}}}, <std::__1::__compressed_pair_elem<std::__1::allocator<char>, 1, true>> = {<std::__1::allocator<char>> = {<std::__1::__non_trivial_if<true, std::__1::allocator<char> >> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>},
  static npos = 18446744073709551615}

I have not yet been able to figure out what the target sequence is to make a minimal reproducible example, but I wanted to see if you had any ideas on what would be causing this walk off the edge behavior.

ifiddes commented 3 months ago

The offending alignment is this:

113676 45 0.829 6.410E-05 39 0 40 527 566 585 39 0 0 584 27M1I12M1D

Which does appear to be walking off the end to me.

milot-mirdita commented 3 months ago

Could you please post the mmseqs command line and terminal output too? Ideally also the sequences with which to reproduce the crash

ifiddes commented 3 months ago

I am having a hard time creating a minimal reference sequence to reproduce the crash. If I reduce the target database down to only the aligned sequence, it doesn't happen.

The command line in question is

mmseqs convertali querydb targetdb --format-output query,target,qstart,qend,tstart,tend,raw,cigar,qaln,taln,qlen --search-type 3

I will continue to try and make a minimal reproducible example. I did notice that adding a N to the start of my query sequence solves the issue.

ifiddes commented 3 months ago

I was unable to make a minimal ref, so I uploaded the ref to Box. It is a human and mouse transcriptome. I had to break it into three parts, just concatenate them.

Here is the query:

>GRCh38_ENSG00000103042.3491.40
TATTTTATTTTGTGTAGAGATGGGGTCTCACTAGGTTGCC

You should be able to reproduce the crash with

mmseqs easy-search tmp.fasta  full_ref.fa aln.out $TMPDIR  --format-output query,target,qstart,qend,tstart,tend,raw,qaln,taln,qlen --search-type 3

https://app.box.com/s/bx5y7s5gpa7ybyc6xera4hujwojagphe https://app.box.com/s/w86ynfly4gi2zt09wb0adqc3g05ox7ok https://app.box.com/s/g50mq3skkaimb8ggunwlqwgbdz5psb6t