torognes / swipe

Smith-Waterman database searches with inter-sequence SIMD parallelisation
GNU Affero General Public License v3.0
58 stars 21 forks source link

swipe crashes with uniclust30_2018_10 and a SCOP sequence #34

Closed konstin closed 2 years ago

konstin commented 3 years ago

I tried to use swipe to annotate uniclust30 with SCOP2 sequences and a specific, innocuous looking sequence reliably causes an error.

I'm using uniclust30_2018_10 from http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08.tar.gz and the the following minimized fasta file:

>8072807 FA=4003632 FA-PDBID=3KCN_A FA-UNIID=Q7UJS6
NERILLVDDDYSLLNTLKRNLSFDFEVTTCESGPEALACIKKSDPFSVIMVDMRMPGMEGTEVIQKARLISPNSVYLMLTGNQDLTTAMEAVNEGQVFRFLNKPCQMSDIKAAINAGIKQYDLVTSKEELLKKT
>8022849 FA=4000237 FA-PDBID=1Z6N_A FA-UNIID=Q9I4A4
MASYAELFDIGEDFAAFVGHGLATEQGAVARFRQKLESNGLPSALTERLQRIERRYRLLVAGEMWCPDCQINLAALDFAQRLQPNIELAIISKGRAEDDLRQRLALERIAIPLVLVLDEEFNLLGRFVERPQAVLDGGPQALAAYKAGDYLEHAIGDVLAIIEGAA

Just using the first sequence passes, but using the both sequences, I get the following error. To me, the second sequence looks innocuous, and it also wasn't the longest sequence in the sequences I tested and which worked.

Error: illegal string length (83).                                                                                                                                                            
Error parsing binary ASN.1 in database sequence definition.

swipe invocation:

swipe -d /path/to/uniclust30_2018_08_consensus.fasta -i /path/to/scop_minized.fa

log excerpt, with the hits cut out:

SWIPE 2.1.0 [May  4 2018 17:34:12]

Reference: T. Rognes (2011) Faster Smith-Waterman database searches
with inter-sequence SIMD parallelisation, BMC Bioinformatics, 12:221.

Database file:     /path/to/uniclust30_2018_08_consensus.fasta 
Database title:    uniclust30_2018_08_consensus.fasta
Database time:     May 11, 2021  2:15 PM
Database size:     3845312150 residues in 15161832 sequences
Longest db seq:    14000 residues
Query file name:   /path/to/scop_minized.fa
Query length:      134 residues
Query description: 8072807 FA=4003632 FA-PDBID=3KCN_A FA-UNIID=Q7UJS6          
Score matrix:      BLOSUM62
Gap penalty:       11+1k
Max expect shown:  10
Min score shown:   1
Max matches shown: 250
Alignments shown:  100
Show gi's:         0
Show taxid's:      0
Threads:           1
Symbol type:       Amino acid

Searching..................................................done

Search started:    Thu, 13 May 2021 20:10:58 UTC
Search completed:  Thu, 13 May 2021 20:12:12 UTC
Elapsed:           74.22s
Speed:             6.942 GCUPS

<many hits...>

Searching..................................................done

Search started:    Thu, 13 May 2021 20:12:12 UTC
Search completed:  Thu, 13 May 2021 20:13:17 UTC
Elapsed:           64.55s
Speed:             9.889 GCUPS

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gnl|BL_ORD_ID|3272035 uc30-1808-90734638|Representative=A0A2T6HJ...   206   7e-52
gnl|BL_ORD_ID|1117575 uc30-1808-72228713|Representative=A0A2S4RD...    98   3e-19
<many more hits...>
gnl|BL_ORD_ID|791832 uc30-1808-5586998|Representative=A0A2P7AV04...    34   4.6  
gnl|BL_ORD_ID|3253374 uc30-1808-124719137|Representative=A0A1E4P...    34   6.1  
Error: illegal string length (83).
Error parsing binary ASN.1 in database sequence definition.

Versions

torognes commented 3 years ago

Thanks for reporting this issue. There is clearly a problem with reading the database and parsing the sequence description in the binary asn.1 format created by formatdb. Previously, I have seen that this format has been changing a bit over time and there might be something similar going on here. I will look into the issue as soon as I have time.

torognes commented 3 years ago

I think this is fixed now in commit 314123b8ce5339dd3952e541678e5f848c6b2487.

It was a problem with very long headers in some of the database entries. Tested with the database you used and seems to work now.

torognes commented 3 years ago

Fixed in release 2.1.1 now available.

torognes commented 3 years ago

Please note that one can also use the modern makeblastdb utility from NCBI, even the recent 2.11.0+ version, if the -blastdb_version 4 option is specified to makeblastdb. It should be compatible with the databases created by the old formatdb.

torognes commented 3 years ago

I hope this solves the issue, @konstin.

konstin commented 2 years ago

Yes it worked, thank you!