soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

clusterupdate output not consistent with cluster/linclust output #362

Open nick-youngblut opened 4 years ago

nick-youngblut commented 4 years ago

clusterupdate only generates one main database file output (eg., clusters_db), regardless of --threads, while cluster and linclust generate one file per thread (eg., --threads=4 generates clusters_db.0, clusters_db.1, clusters_db.2, clusters_db.3). This leads to pipeline complications, given that downstream processing of the clusters_db may require multiple inputs (clusters_db.*) or just one input (cluster_db). It would help if clusterupdate and cluster/linclust were consistent. It would be best if cluster/linclust just produced one database file per thread.

I'm running mmseqs2 11.e1a1c

milot-mirdita commented 4 years ago

I think MMseqs2 12 should now deal with this better. Could you please update? We resolved many issues with cluster updating in the last release.

nick-youngblut commented 4 years ago

I've been using v12, and clusterupdate just generates 1 main database file, regardless of the number of threads.

nick-youngblut commented 4 years ago

Just to check: is there a command for merging all of the cluster_db.* files generated by linclust and cluster? I didn't see one in the list of subcommands, but it's a long list.

milot-mirdita commented 4 years ago

Currently still no, however I just added an environment variable to prohibit MMseqs2 to create split databases. If you export MMSEQS_FORCE_MERGE=1 split databases will not be produced anymore. This might slow down some intermediate steps somewhat though. I might also build a module to merge databases when I have time.