steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

Error std::out_of_range when creating database with specific cif files #268

Closed QuanEvans closed 1 month ago

QuanEvans commented 1 month ago

Description: When attempting to create a database with specific cif files using foldseek createdb, the program terminates unexpectedly with a std::out_of_range error. This error consistently occurs with the following cif files: 4opj, 3l2q, 3fpy, 2l2b, 6zqc, 7a2h, 4a2i, 2bde, 7ux2.

Steps to Reproduce:

  1. Download the CIF files for the specified PDB IDs.
  2. Execute the foldseek createdb command for each CIF file individually, e.g., foldseek createdb 4opj.cif 4opj_DB.
  3. Observe that the program aborts with an error message: terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0) Aborted (core dumped).

Expected Behavior: The foldseek createdb command should successfully create a database for each CIF file without crashing.

Actual Behavior: The program crashes with a std::out_of_range exception when processing the mentioned CIF files.

Environment: OS: Linux 6.5.0-28-generic #29-Ubuntu Foldseek Version: Downloaded from this release

Additional Information: This issue may relate to the handling of certain data within the CIF files. It occurs consistently only with the specified PDB ids, suggesting a possible bug in how data from these files is processed or accessed.

milot-mirdita commented 1 month ago

I can't reproduce the crash:

wget https://files.rcsb.org/download/4OPJ.cif
wget https://github.com/steineggerlab/foldseek/releases/download/8-ef4e960/foldseek-linux-avx2.tar.gz
tar xzvf foldseek-linux-avx2.tar.gz
./foldseek/bin/foldseek createdb 4OPJ.cif out_tmp

Seems to work correctly, same on my mac with release 8 and the latest git.

QuanEvans commented 1 month ago

Sorry, I didn't initially check the files I downloaded using the script from https://github.com/google-deepmind/alphafold/blob/dbe2a438ebfc6289f960292f15dbf421a05e563d/scripts/download_pdb_mmcif.sh. It seems like these PDB files were corrupted during download. I've now tested with files downloaded directly from https://files.rcsb.org/download/4OPJ.cif and with both the release 8 and the latest git version of Foldseek on my Desktop, and everything works correctly. I will close this issue. Thank you for the support!

QuanEvans commented 1 month ago

BTW, as a suggestion, could the foldseek possibly be enhanced to display the names of any problematic files and improve the clarity of error messages? This would make it easier to identify and troubleshoot issues like this in the future.

milot-mirdita commented 1 month ago

Could you upload one of the broken files?