steineggerlab / foldcomp

Compressing protein structures effectively with torsion angles
GNU General Public License v3.0
145 stars 14 forks source link

Retrieving structures from `highquality_clust30` #37

Closed PawelSzczerbiak closed 1 year ago

PawelSzczerbiak commented 1 year ago

I'm trying to extract structures from highquality_clust30 but only representative structures are returned (as I understand, first column in lookup table controls cluster ID). Do you know how to get all compressed structures in that database?

khb7840 commented 1 year ago

According to the documentation in ESMatlas, highqualith_clust30 is a set of representative structures. Currently, I'm not aware of representative-member mapping of clusters in ESMatlas.

PawelSzczerbiak commented 1 year ago

Thanks for your answer! The problem is that when I'm trying to uncompress 37M structures in highquality_clust30 (described by MGnify ID in the second column of the lookup table) I'm getting ~1M of cluster representatives (described by raw ID number in the first column of the lookup table). This seems to be too few: that DB weights ~112 GB whereas e.g. afdb_uniprot_v4, containing ~214M predictions - 1 TB. So, are the remaining ~36M predictions in highquality_clust30 accessible in any way?

khb7840 commented 1 year ago

Thank you for notifying this. I found out that there was an index error for the database which prevent accession of 36M entries. Fixed version of database was uploaded, getting 1M was an error, and I apologize for this confusing error. I'm still not sure about this but these 36M predictions seem to be representatives of clusters.

PawelSzczerbiak commented 1 year ago

Hi, thanks for clarification. Is there anything special about those 1M IDs that were accessible in that erroneous version or it was just a random pool?

khb7840 commented 1 year ago

Index was written wrongly with multi-threading so 1M IDs were just written by single thread.

PawelSzczerbiak commented 1 year ago

Thanks!