Databases used in the mmseq2 search, local version

sokrypton / ColabFold

Making Protein folding accessible to all!

MIT License

1.97k stars 495 forks source link

Databases used in the mmseq2 search, local version #20

Open guilhemfaure opened 3 years ago

guilhemfaure commented 3 years ago

Hello, I would like to run locally the msa building step of the colab notebook and use the exact same set of databases to do some comparison with other databases. Is it possible to get access to the set of databases the mmseq2 server is using as well as the version of mmseqs2 and the specific command lines executed on the server? In the slides you presented (awesome presentation!), you mentioned you are using a 30%id clustered DB built from SMAG, MGNIFY, BFD, and MetaEuk. Do you provide somewhere a downloadable version of the master 30%seq_id db?

Thanks a lot!

milot-mirdita commented 3 years ago

We are working on preparing the preprint and will make the databases available then. This should hopefully happen very soon.

guilhemfaure commented 3 years ago

Thanks a lot! Looking forward to reading your paper!

avilella commented 3 years ago

I am also interested in running ColabFold (MMseqs2 works great for me) on a local installation, or a way that allows us to programmatically call it for 10E4-10E5 of molecules. Looking forward to a solution one way or another, and also about reading the details behind in a preprint.

fstrozzi commented 3 years ago

Hello, the preprint came out https://www.biorxiv.org/content/10.1101/2021.08.15.456425v1.full.pdf but it doesn't seem to mention a direct access to download the clustered database used for MMSeqs2 search. Do you think it would be possible to provide a direct link for that ?

Thanks for all this work, ColabFold is just great !

martin-steinegger commented 3 years ago

We are so sorry for the delay. We have the database ready but our FTP storage space is limited. We asked our IT for an increase of the quota. Once we get it approved we will upload the database and scripts how to build and run it.

fstrozzi commented 3 years ago

@martin-steinegger nothing to be sorry about, you are doing a fantastic job with this project !

And thanks for the quick answer. Have you also thought about storing these datasets and the database in the cloud with e.g. the AWS Open Dataset repository (and/or the equivalent thing on Google Cloud ?)

martin-steinegger commented 3 years ago

@fstrozzi thank you! We would be happy to host our databases on the open dataset repository. But we were never successful when applying to Google or AWS.

milot-mirdita commented 3 years ago

We have uploaded the ColabFold databases at https://colabfold.mmseqs.com. You can find instructions how to create MMseqs2 databases from these archives in the MMseqs2 wiki.

We also finished merging all the MMseqs2 changes back to the main repository (starting from commit https://github.com/soedinglab/MMseqs2/commit/f65187996c3a73b5a9f3f32d08f5de2313ca719b it should work).

We will make running everything easier as soon as possible, however you should be able to get a local ColabFold installation running. We haven't finished setup procedures for the template search databases yet. Hopefully we will manage that in the next few days too