Open nick-youngblut opened 4 years ago
I think MMseqs2 12 should now deal with this better. Could you please update? We resolved many issues with cluster updating in the last release.
I've been using v12, and clusterupdate
just generates 1 main database file, regardless of the number of threads.
Just to check: is there a command for merging all of the cluster_db.*
files generated by linclust
and cluster
? I didn't see one in the list of subcommands, but it's a long list.
Currently still no, however I just added an environment variable to prohibit MMseqs2 to create split databases.
If you export MMSEQS_FORCE_MERGE=1
split databases will not be produced anymore. This might slow down some intermediate steps somewhat though. I might also build a module to merge databases when I have time.
clusterupdate
only generates one main database file output (eg.,clusters_db
), regardless of--threads
, whilecluster
andlinclust
generate one file per thread (eg.,--threads=4
generatesclusters_db.0
,clusters_db.1
,clusters_db.2
,clusters_db.3
). This leads to pipeline complications, given that downstream processing of theclusters_db
may require multiple inputs (clusters_db.*
) or just one input (cluster_db
). It would help if clusterupdate and cluster/linclust were consistent. It would be best if cluster/linclust just produced one database file per thread.I'm running
mmseqs2 11.e1a1c