Closed PawelSzczerbiak closed 1 year ago
According to the documentation in ESMatlas, highqualith_clust30
is a set of representative structures. Currently, I'm not aware of representative-member mapping of clusters in ESMatlas.
Thanks for your answer! The problem is that when I'm trying to uncompress 37M structures in highquality_clust30
(described by MGnify ID in the second column of the lookup table) I'm getting ~1M of cluster representatives (described by raw ID number in the first column of the lookup table). This seems to be too few: that DB weights ~112 GB whereas e.g. afdb_uniprot_v4
, containing ~214M predictions - 1 TB. So, are the remaining ~36M predictions in highquality_clust30
accessible in any way?
Thank you for notifying this. I found out that there was an index error for the database which prevent accession of 36M entries. Fixed version of database was uploaded, getting 1M was an error, and I apologize for this confusing error. I'm still not sure about this but these 36M predictions seem to be representatives of clusters.
Hi, thanks for clarification. Is there anything special about those 1M IDs that were accessible in that erroneous version or it was just a random pool?
Index was written wrongly with multi-threading so 1M IDs were just written by single thread.
Thanks!
I'm trying to extract structures from
highquality_clust30
but only representative structures are returned (as I understand, first column in lookup table controls cluster ID). Do you know how to get all compressed structures in that database?