soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

Metaclust getting smaller and smaller #54

Closed kad-ecoli closed 6 years ago

kad-ecoli commented 6 years ago

Metaclust, a database clustered by linclust protocol in MMseqs2, is becoming smaller with each release. Metaclust95 2017_01 has 97G. Metaclust 2017_05 has 60G. Metaclust 2018_01 has 28G only. Shouldn't the number of Metaclust entries increase with time?

kad-ecoli commented 6 years ago

Seems the file size is updated.

martin-steinegger commented 6 years ago

Thank for your report. I copied a wrong file to the Metaclust 2018_01 release. It should be fixed now. Information on the current release can be found in the latest version of the preprint: https://www.biorxiv.org/content/early/2018/01/05/104034.full.pdf+html.

The input set size of the Metaclust did not increase since the first release. The data should be seen as proof of concept for Linclust. We can not commit to such a data intensive procedure at this point. It took weeks to download the full datasets used in this study.

We believe that a sequence database based on metagenomic sequences should be offered rather by institutions that have direct access to huge amounts of metagenomic data (e.g. EMBL, NCBI, JGI, Argonne National Lab, ...).