steineggerlab / foldcomp

Compressing protein structures effectively with torsion angles
GNU General Public License v3.0
145 stars 14 forks source link

Errors with compressing PDB/CIF #44

Closed yunwilliamyu closed 7 months ago

yunwilliamyu commented 7 months ago

We have been benchmarking Foldcomp on a number of examples from the PDB, especially discontinuous or multichain proteins. When Foldcomp works, it is substantially better than other tools, but very often, it fails in mysterious ways.

On 2b9v, 2fug, 2ign, 2j28, 2ja7, 2ja8, 2jbp, 2jd8, 3j7q, 3j9m, 4ug0, 4wq1, 4wro, 6gaw, and 6hif, Foldcomp appears to get stuck in some kind of non-terminating state, and just keeps spinning at 100% CPU usage for days on end. (we've been running each of those processes for 5 days now).

On 2ja9, 5t2a, and 6fxc, it segfaults immediately. On 2ja9, it segfaults after creating an empty output file, but for the other two examples, it segfaults even before creating the empty output file.

On 16pk, foldcomp thorws another error:

Compressing 16PK.pdb to 16PK.fcz
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
Aborted

This behavior happens on a Debian GNU/Linux 11 machine with an Intel Xeon CPU on Google Cloud. We are currently running the latest Github build as of 2023-11-01, but the infinite loop behavior also happens with the statically compiled binary file in prior testing.

milot-mirdita commented 7 months ago

We have a few unresolved issue for experimentally-derived PDB files. They mostly revolve around residue index breaks/jumps that are relatively common in the PDB. We also have a crash if we try to compress only a single amino acid (this can happen with residue index jumps and the current splitting of discontinuous proteins into multiple outputs). You can use --skip-discontinuous to skip most of these, however there are other issues still that might lead to errors.

We have primarily focused on predicted structures of AF2 and ESM as these don't suffer from all of the idiosyncrasies found within the PDB. We are planing to make Foldcomp eventually works on the whole PDB, however, for now we focus on predicted structures.

yunwilliamyu commented 7 months ago

Ah. OK. Thanks so much for the quick response; I just wanted to let y'all know, and I'll look forward to future iterations of Foldcomp!