steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
696 stars 92 forks source link

Missing taxonomic information in PDB entries #142

Open drumyerscough opened 1 year ago

drumyerscough commented 1 year ago

Hello,

I've noticed that for a particular query a small percentage (~4%) of hits from the PDB100 do not include taxonomic identifiers. The last two fields of these lines in the m8 files are "0 unclassified" even though taxonomic identifiers do seem to be present when I view the structures on the PDB website. This isn't a major problem, but it is annoying given that the server uses the PDB100 and I'm using taxonomic identifiers to remove different structures of the same protein, point mutants, etc.

I can upload the query structure and the m8 file if needed.

Thanks!

milot-mirdita commented 1 year ago

I think these were primarily cases where the PDB mmCIF file contained weird taxonomy entries like "Species1 & Species2". I just dropped these instead of coming up with some ad-hoc solution. I am not sure if there is "cleaner" input elsewhere.