steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

precomputed pdb lookup and sequence don't line up #258

Open tn-7 opened 3 months ago

tn-7 commented 3 months ago

executed: foldseek databases PDB pdb tmp

The first line of the pdb file after is: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMIQQKRWDEWAVNMAKSRWYNQTPNRAKRVITTFRTGTWDAYK

however this doesn't correspond to the first line of the pdb.lookup which is 200l_A.

instead the first line through blast shows it belongs to 145l_A which is on this line of the lookup: grep -i -a -n 145L_A pdb.lookup 159:158 145l_A 121

how is the ordering done so that the id's match?

milot-mirdita commented 3 months ago

The database entries are not stored in order. They are stored in our internal MMseqs2 database format: https://github.com/soedinglab/MMseqs2/wiki#mmseqs2-database-format

The lookup file points to a database key (first column of the .lookup file), which points to the .index (again first column). In the index you can lookup the byte offset (second column) that points to the data file.

The data file is a special issue for the PDB, since we ship it as a clustered database. The full PDB data is split across two seperate files pdb_seq.0 and pdb_seq.1, the former contains only the cluster representatives and the latter all others.

I would recommend to do database manipulations with the various Foldseek/MMseqs2 commands.