steineggerlab / foldcomp

Compressing protein structures effectively with torsion angles
GNU General Public License v3.0
145 stars 14 forks source link

Will Foldcomp database for ESM atlas be released? #17

Closed yakomaxa closed 1 year ago

yakomaxa commented 1 year ago

Dear foldcomp developer team

Thanks to foldcomp and pre-compiled foldcomp database distributions, I have finished walking through the entire AlphaFold database for my bioinformatics project. Your softwares and databases are very helpful to comprehensive structural bioinformatics analysis and I'd like to thank you very much.

I'm going to extend my research target to ESM atlas and now considering if I should compile its foldcomp database by myself.

Before starting compiling ESM foldcomp database by myself, I would like to ask you a question. Do you have any plans to distribute foldcomp database for whole ESM atlas?

khb7840 commented 1 year ago

Firstly, thank you for using foldcomp. Your comments are helpful to improve our tool. And, yes, we are planning to compress ESM atlas too. Currently, we are working on AlphaFold Uniprot v4 so after that we will compress ESM atlas.

yakomaxa commented 1 year ago

Thank you very much for quick response (to many issues I've opened).

I'm glad to hear that you are going to distribute ESM atlas as well as AFDB v4. I did not notice that AFDB was updated to v4. Thank you for sharing the information.

I'm looking forward to seeing future releases and walking through both of DB with foldcomp.

milot-mirdita commented 1 year ago

We uploaded the esmatlas high-quality set. You can download it by calling foldcomp.setup("highquality_clust30").

yakomaxa commented 1 year ago

Thank you for the announcement. I've downloaded high-quality set and found it takes about 110GB.

I'm preparing the storages for these DBs: Do you have size estimates for full ESM-atlas foldcomp DB and AlphaFold-v4 foldcomp DB? I guess the full ESM-atlas DB could be about x15 larger than high-quality DB so it would be ~1.7TB or so, but I'm not sure what actual data compression ratio foldcomp can accomplish for that DB. As for full AlphaFold DB, I know v3 took about 950GB but I'm not sure what the difference between v3 and v4 is.

milot-mirdita commented 1 year ago

15x sounds about right. Currently we include the 37M highquality_clust30 structures (except ~100k that had some issues), the full database is about 600M.

The AlphaFold DB v4 should be about the same size, it just contains the fixes for the broken structures.

yakomaxa commented 1 year ago

Thank you for sharing the estimations and information on difference between v3 and v4.

So, 4TB disk space seems enough just to store them at this moment although some extra space would be needed to actually work with them.

Anyway, Foldcomp makes such massive database handleable for averagely equipped researcher like me. Great software.

khb7840 commented 1 year ago

Foldcomp DB of ESM atlas v2023_02 is uploaded as esmatlas_v2023_02. You can download it with foldcomp.setup('esmatlas_v2023_02') in the python API.

yakomaxa commented 1 year ago

I appreciate your notification. I'll try that. Thank you again for developing a wonderful software.