soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 194 forks source link

Error when using clusterupdate #17

Closed dcpastor closed 7 years ago

dcpastor commented 7 years ago

I was trying to use clusterupdate to update a clustering (DB_trimmed_clu) build from DB_trimmed (a library of proteins) to DB_clusterupdate from a extended version of the library (DB_new) with 2 sequences overlap.

However, the program is not able to finish and I get the error:

mv: rename tmp/aln* to tmp/search/aln: No such file or directory mv: rename tmp/clu_ to tmp/search/clu*: No such file or directory mv: rename tmp/input to tmp/search/input_: No such file or directory

Although the program is able to continue until the merging of the updated clusterings (see log below).

I also get a random number of warnings (depending on the execution) pointing out that I am using DNA, but I am not. For instance:

WARNING: Sequence (dbKey=17) contains only ATGC. It might be a nucleotide sequence.

I attach the log of the cluster update call:

mmseqs clusterupdate DB_trimmed DB_new DB_trimmed_clu DB_clusterupdate tmp &> update_log.txt

Program call: DB_trimmed DB_new DB_trimmed_clu DB_clusterupdate tmp

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Add backtrace false Alignment mode 0 E-value threshold 0.001 Seq. Id Threshold 0 Coverage threshold 0 Target Coverage threshold 0 Max. sequence length 32000 Compositional bias 1 Profile false Realign hit false Max Reject 2147483647 Max Accept 2147483647 Include identical Seq. Id. false Threads 4 Verbosity 3 Sensitivity 4 K-mer size 0 K-score 2147483647 Alphabet size 21 Offset result 0 Split DB 0 Split mode 2 Diagonal Scoring 1 Mask Residues 1 Minimum Diagonal score 15 Spaced Kmer 1 Profile e-value threshold 0.001 Use global sequence weighting false Maximum sequence identity threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select n most diverse seqs 100 Pseudo count a 1 Pseudo count b 1.5 Number search iterations 1 Start sensitivity 4 sensitivity step size 1 Sets the MPI runner
Cluster mode 0 Max depth connected component 1000 Similarity type 2 Cascaded clustering false Cluster fragments false Remove Temporary Files false Match sequences by their ID false

Program call: DB_trimmed DB_new tmp/removedSeqs tmp/mappingSeqs tmp/newSeqs --threads 4 -v 3

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Match sequences by their ID false Threads 4 Verbosity 3

=================================================== ====== Filter out the new from old sequences ======

Program call: tmp/newSeqs DB_new tmp/NEWDB.newSeqs

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Verbosity 3

Start writing to file tmp/NEWDB.newSeqs Time for merging files: 0 h 0 m 0 s

=== Update the old clustering with the new keys ===

Program call: DB_trimmed_clu tmp/OLDCLUST.mapped --mapping-file tmp/mappingSeqs

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Filter column 1 Filter regex ^.*$ Positive filter true Filter file
Mapping file tmp/mappingSeqs Threads 4 Verbosity 3 trim the results to one column false Extract n lines 0 Numerical comparison operator
Numerical comparison value 0 Sort (increasing:1, decreasing: 2, shuffle: 3) the entries by numerical value 0

Mapping keys by file tmp/mappingSeqs Time for merging files: 0 h 0 m 0 s

======= Extract representative sequences ==========

Program call: DB_new DB_new tmp/OLDCLUST.mapped tmp/OLDDB.mapped.repSeq --only-rep-seq

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Profile false Profile e-value threshold 0.001 Allow Deletion false Add internal id false Compositional bias 1 Maximum sequence identity threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select n most diverse seqs 100 Threads 4 Verbosity 3 Compress MSA false Summarize headers false Summary prefix cl Representative sequence true

Start computing representative sequences. Time for merging files: 0 h 0 m 0 s

Done. Time for processing: 0 h 0 m 0s

======= Search the new sequences against ========== ========= previous (rep seq of) clusters ==========

Program call: tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq tmp/newSeqsHits tmp --max-seqs 1 --sub-mat blosum62.out --alignment-mode 0 -e 0.001 --min-seq-id 0 -c 0 --max-seq-len 32000 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --threads 4 -v 3 -s 4 -k 0 --k-score 2147483647 --alph-size 21 --offset-result 0 --split 0 --split-mode 2 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --e-profile 0.001 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 100 --pca 1 --pcb 1.5 --num-iterations 1 --start-sens 4 --sens-step-size 1

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Add backtrace false Alignment mode 0 E-value threshold 0.001 Seq. Id Threshold 0 Coverage threshold 0 Target Coverage threshold 0 Max. sequence length 32000 Max. results per query 1 Compositional bias 1 Profile false Realign hit false Max Reject 2147483647 Max Accept 2147483647 Include identical Seq. Id. false Threads 4 Verbosity 3 Sensitivity 4 K-mer size 0 K-score 2147483647 Alphabet size 21 Offset result 0 Split DB 0 Split mode 2 Diagonal Scoring 1 Mask Residues 1 Minimum Diagonal score 15 Spaced Kmer 1 Profile e-value threshold 0.001 Use global sequence weighting false Maximum sequence identity threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select n most diverse seqs 100 Pseudo count a 1 Pseudo count b 1.5 Number search iterations 1 Start sensitivity 4 sensitivity step size 1 Sets the MPI runner

/Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp /Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp Program call: tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq /Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp/tmp/pref_4 --sub-mat blosum62.out -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 1 --offset-result 0 --split 0 --split-mode 2 -c 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 4 -v 3 -s 4

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Sensitivity 4 K-mer size 0 K-score 2147483647 Alphabet size 21 Max. sequence length 32000 Profile false Max. results per query 1 Offset result 0 Split DB 0 Split mode 2 Coverage threshold 0 Compositional bias 1 Diagonal Scoring 1 Mask Residues 1 Minimum Diagonal score 15 Include identical Seq. Id. false Spaced Kmer 1 Threads 4 Verbosity 3

Initialising data structures... Using 4 threads.

Cound not find precomputed index. Compute index. Query database: tmp/NEWDB.newSeqs(size=182) Target database: tmp/OLDDB.mapped.repSeq(size=3) Use kmer size 6 and split 1 using split mode 0 Needed memory (1381015863 byte) of total memory (25769803776 byte) Substitution matrices... Time for init: 0 h 0 m 1s

Process prefiltering step 0 of 1

Index table: counting k-mers... WARNING: Sequence (dbKey=17) contains only ATGC. It might be a nucleotide sequence. WARNING: Sequence (dbKey=21) contains only ATGC. It might be a nucleotide sequence.

Index table: Masked residues: 0 Index table: fill... Index table: removing duplicate entries... Index table init done.

DB statistic Entries: 181 DB Size: 686130054 (byte) Avg Kmer Size: 2.11039e-06 Top 10 Kmers LRIDDA 1 YSLDDA 1 RRLGEA 1 IGEREA 1 RDKPGA 1 GFTIIA 1 WIRAKA 1 LRRDPA 1 HKRERA 1 KTEKRA 1 Min Kmer Size: 0 Empty list: 85765939

Time for index table init: 0 h 0 m 1s

k-mer similarity threshold: 103 k-mer match probability: 0

Starting prefiltering scores calculation (step 0 of 1) Query db start 0 to 182 Target db start 0 to 3

67 k-mers per position. 0 DB matches per sequence. 0 Overflows . 0 sequences passed prefiltering per query sequence. Median result list size: 0 176 sequences with 0 size result lists.

Time for prefiltering scores calculation: 0 h 0 m 0s Time for merging files: 0 h 0 m 0 s

Overall time for prefiltering run: 0 h 0 m 1s Program call: tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq /Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp/tmp/pref_4 /Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp/tmp/aln_4 --sub-mat blosum62.out --alignment-mode 0 -e 0.001 --min-seq-id 0 -c 0 --max-seq-len 32000 --max-seqs 1 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --threads 4 -v 3

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Add backtrace false Alignment mode 0 E-value threshold 0.001 Seq. Id Threshold 0 Coverage threshold 0 Target Coverage threshold 0 Max. sequence length 32000 Max. results per query 1 Compositional bias 1 Profile false Realign hit false Max Reject 2147483647 Max Accept 2147483647 Include identical Seq. Id. false Threads 4 Verbosity 3

Init data structures... Compute score only. Using 4 threads. Calculation of Smith-Waterman alignments. Time for merging files: 0 h 0 m 0 s

All sequences processed.

6 alignments calculated. 6 sequence pairs passed the thresholds (1 of overall calculated). 0.032967 hits per query sequence. Time for alignments calculation: 0 h 0 m 0s Program call: tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq tmp/newSeqsHits tmp/newSeqsHits.swapped.all

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Threads 4 Verbosity 3

Time for merging files: 0 h 0 m 0 s Program call: tmp/newSeqsHits.swapped.all tmp/newSeqsHits.swapped --trim-to-one-column

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Filter column 1 Filter regex ^.*$ Positive filter true Filter file
Mapping file
Threads 4 Verbosity 3 trim the results to one column true Extract n lines 0 Numerical comparison operator
Numerical comparison value 0 Sort (increasing:1, decreasing: 2, shuffle: 3) the entries by numerical value 0

Time for merging files: 0 h 0 m 0 s

= Merge found sequences with previous clustering =

Program call: tmp/OLDCLUST.mapped tmp/updatedClust tmp/newSeqsHits.swapped tmp/OLDCLUST.mapped

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Merge prefixes
Verbosity 3

Merging the results to tmp/updatedClust Done Time for merging files: 0 h 0 m 0 s Time for merging: 0 h 0 m 0s

=========== Extract unmapped sequences ============

Program call: tmp/noHitSeqList DB_new tmp/toBeClusteredSeparately

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Verbosity 3

Start writing to file tmp/toBeClusteredSeparately Time for merging files: 0 h 0 m 0 s

===== Cluster separately the alone sequences ======

mv: rename tmp/aln* to tmp/search/aln: No such file or directory mv: rename tmp/clu_ to tmp/search/clu*: No such file or directory mv: rename tmp/input to tmp/search/input_: No such file or directory Program call: tmp/toBeClusteredSeparately tmp/newClusters tmp --sub-mat blosum62.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --offset-result 0 --split 0 --split-mode 2 -c 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 4 -v 3 --alignment-mode 0 -e 0.001 --min-seq-id 0 --max-rejected 2147483647 --max-accept 2147483647 --cluster-mode 0 --max-iterations 1000 --similarity-type 2

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Sensitivity 4 K-mer size 0 K-score 2147483647 Alphabet size 21 Max. sequence length 32000 Profile false Max. results per query 300 Offset result 0 Split DB 0 Split mode 2 Coverage threshold 0 Compositional bias 1 Diagonal Scoring 1 Mask Residues 1 Minimum Diagonal score 15 Include identical Seq. Id. false Spaced Kmer 1 Threads 4 Verbosity 3 Add backtrace false Alignment mode 0 E-value threshold 0.001 Seq. Id Threshold 0 Target Coverage threshold 0 Realign hit false Max Reject 2147483647 Max Accept 2147483647 Cluster mode 0 Max depth connected component 1000 Similarity type 2 Cascaded clustering false Cluster fragments false Remove Temporary Files false Sets the MPI runner

Program call: tmp/toBeClusteredSeparately tmp/aln_redundancy

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Alphabet size 3 Seq. Id Threshold 0 Max. sequence length 32000 Threads 4 Verbosity 3

Y -> F V -> I M -> L Q -> E T -> S R -> K S -> A N -> D L -> I H -> E K -> E P -> C E -> D C -> A I -> F G -> A D -> A A -> A Reduced amino acid alphabet: F W X Hashing sequences ... Done. Compute 169 unique hashes. Time for merging files: 0 h 0 m 0 s Program call: tmp/toBeClusteredSeparately tmp/aln_redundancy tmp/clu_redundancy --cluster-mode 0 --max-seqs 300 -v 3 --max-iterations 1000 --similarity-type 2 --threads 4

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Cluster mode 0 Max. results per query 300 Verbosity 3 Max depth connected component 1000 Similarity type 2 Threads 4

Init... Opening sequence database... Opening alignment database... done. Clustering mode: Set Cover

Sort entries.

Find missing connections.

Found 7 new connections.

Reconstruct initial order.

Add missing connections.

Time for Read in: 0 m 0s

Writing results... ...done. Time for clustering: 0 m 0s Time for merging files: 0 h 0 m 0 s Total time: 0 m 0s

Size of the sequence database: 176 Size of the alignment database: 176 Number of clusters: 169 Program call: tmp/order_redundancy tmp/toBeClusteredSeparately tmp/input_step_redundancy

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Verbosity 3

Start writing to file tmp/input_step_redundancy Time for merging files: 0 h 0 m 0 s Program call: tmp/input_step_redundancy tmp/input_step_redundancy tmp/pref --sub-mat blosum62.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 300 --offset-result 0 --split 0 --split-mode 2 -c 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 4 -v 3

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Sensitivity 4 K-mer size 0 K-score 2147483647 Alphabet size 21 Max. sequence length 32000 Profile false Max. results per query 300 Offset result 0 Split DB 0 Split mode 2 Coverage threshold 0 Compositional bias 1 Diagonal Scoring 1 Mask Residues 1 Minimum Diagonal score 15 Include identical Seq. Id. false Spaced Kmer 1 Threads 4 Verbosity 3

Initialising data structures... Using 4 threads.

Cound not find precomputed index. Compute index. Query database: tmp/input_step_redundancy(size=169) Target database: tmp/input_step_redundancy(size=169) Use kmer size 6 and split 1 using split mode 0 Needed memory (1381292076 byte) of total memory (25769803776 byte) Substitution matrices... Time for init: 0 h 0 m 0s

Process prefiltering step 0 of 1

Index table: counting k-mers...

Index table: Masked residues: 166 Index table: fill... Index table: removing duplicate entries... Index table init done.

DB statistic Entries: 30623 DB Size: 686312706 (byte) Avg Kmer Size: 0.000357052 Top 10 Kmers GTKRRA 13 NTLRYA 13 RLRRLR 13 RIRRLR 12 GRRANL 11 TWYINL 11 SITLMR 11 GVITGR 10 FSWYAT 10 AELQFV 9 Min Kmer Size: 0 Empty list: 85740284

Time for index table init: 0 h 0 m 1s

k-mer similarity threshold: 103 k-mer match probability: 0

Starting prefiltering scores calculation (step 0 of 1) Query db start 0 to 169 Target db start 0 to 169

68 k-mers per position. 375 DB matches per sequence. 0 Overflows . 25 sequences passed prefiltering per query sequence. Median result list size: 21 0 sequences with 0 size result lists.

Time for prefiltering scores calculation: 0 h 0 m 0s Time for merging files: 0 h 0 m 0 s

Overall time for prefiltering run: 0 h 0 m 2s Program call: tmp/input_step_redundancy tmp/input_step_redundancy tmp/pref tmp/aln --sub-mat blosum62.out --alignment-mode 0 -e 0.001 --min-seq-id 0 -c 0 --max-seq-len 32000 --max-seqs 300 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --threads 4 -v 3

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Sub Matrix blosum62.out Add backtrace false Alignment mode 0 E-value threshold 0.001 Seq. Id Threshold 0 Coverage threshold 0 Target Coverage threshold 0 Max. sequence length 32000 Max. results per query 300 Compositional bias 1 Profile false Realign hit false Max Reject 2147483647 Max Accept 2147483647 Include identical Seq. Id. false Threads 4 Verbosity 3

Init data structures... Compute score only. Using 4 threads. Calculation of Smith-Waterman alignments. Time for merging files: 0 h 0 m 0 s

All sequences processed.

4237 alignments calculated. 4235 sequence pairs passed the thresholds (0.999528 of overall calculated). 25.0592 hits per query sequence. Time for alignments calculation: 0 h 0 m 0s Program call: tmp/input_step_redundancy tmp/aln tmp/clu_step0 --cluster-mode 0 --max-seqs 300 -v 3 --max-iterations 1000 --similarity-type 2 --threads 4

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Cluster mode 0 Max. results per query 300 Verbosity 3 Max depth connected component 1000 Similarity type 2 Threads 4

Init... Opening sequence database... Opening alignment database... done. Clustering mode: Set Cover

Sort entries.

Find missing connections.

Found 656 new connections.

Reconstruct initial order.

Add missing connections.

Time for Read in: 0 m 0s

Writing results... ...done. Time for clustering: 0 m 0s Time for merging files: 0 h 0 m 0 s Total time: 0 m 0s

Size of the sequence database: 169 Size of the alignment database: 169 Number of clusters: 17 Program call: tmp/toBeClusteredSeparately tmp/clu tmp/clu_redundancy tmp/clu_step0

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Verbosity 3

List amount 176 Clustering step 1... Clustering step 2... Writing the results... Time for merging files: 0 h 0 m 0 s ...done.

==== Merge the updated clustering together with === ===== the new clusters ======

Program call: tmp/updatedClust tmp/newClusters DB_clusterupdate

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Verbosity 3

Time for merging files: 0 h 0 m 0 s Time for concatenating DBs: 0 h 0 m 0s

ClovisG commented 7 years ago

Hi, Thanks for reporting this. The 3 error messages you get are not critical. It is due to a recent change in the default parameters of mmseqs that makes it not generating the aln*, clu and input_ files. This shouldn't affect the workflow. From your call, you should actually get a DB_clusterupdate database. Could you check if it is conform to what you expected ?

Concerning the ATCG warning, it is just a warning to the user in case there are some sequences containing only residues A, T, C and G. If sometimes you do not have the warning it may may be due to the fact the tmp folder was not empty and mmseqs did not perform the search again.

Best, Clovis.

dcpastor commented 7 years ago

Thanks for your reply, You are right, I get a DB_clusterupdate and DB_clusterupdate.index when the process finishes. The reason why I though clusterupdate didn't finish is that when I try to extract the fasta file from DB_clusterupdate I get the following error:

mmseqs createseqfiledb DB_new DB_clusterupdate clu_rm

I get another error:

MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Min Sequences 1 Max Sequences 2147483647 HH format false Threads 4 Verbosity 3

Invalid entry in line 1! Invalid entry in line 2! Invalid entry in line 3! Invalid entry in line 4! Invalid entry in line 5! Invalid entry in line 6!

I thought that maybe DB_new didn't contain the link to the sequences that DB_clusterupdate refers to and the database that merges DB_trimmed and DB_new is missing, probably because clusterupdate stopped before it was generated due to the problems I refer in the first post.

But if this is not the case, how can I generate the fasta file from DB_updatecluster? Which database should I use?

Thanks beforehand.

ClovisG commented 7 years ago

Hi, Thank you for your report. If you pull the last commit, it should solve your problem.

Best, Clovis.

dcpastor commented 7 years ago

Thanks, now it is able to generate the fasta file, although 6 sequences are missing. Those 6 are lost in the createseqfiledb step.

mmseqs createseqfiledb DB_new DB_clusterupdate clu_rm Program call: DB_new DB_clusterupdate clu_rm

MMseqs Version: d410b871f195076386dc3d7382db26e976cb6db5 Min Sequences 1 Max Sequences 2147483647 HH format false Threads 4 Verbosity 3

Invalid entry in cluster 12, line 1! Invalid entry in cluster 12, line 2! Invalid entry in cluster 12, line 3! Invalid entry in cluster 12, line 4! Invalid entry in cluster 12, line 5! Invalid entry in cluster 12, line 6! Time for merging files: 0 h 0 m 0 s

ClovisG commented 7 years ago

Hi, Could you upload somewhere your files and call script so that I can take a closer look to it ?

Best, Clovis.

dcpastor commented 7 years ago

Hi, Sorry, I should have started from scratch with the new commit... I did it and I don't get that "Invalid entry in cluster..." error. However, the representatives of each cluster are not part of the elements of the cluster (see the .tsv and .fa in the results folder). The code and datasets that I used are contained here: https://www.dropbox.com/sh/pd9qdkrq084lu29/AACL60pabYsgUhqnCEnsASgea?dl=0

My results are: https://www.dropbox.com/sh/umndghqhmibogdu/AADulxnmCH66-rtEAkjtBD9ga?dl=0

ClovisG commented 7 years ago

Hi, I'm glad that the cluster update works for you now.

Concerning the generated TSV file: (i) When you have two different databases DB_old and DB_new, then the ffindex keys do not necessarily match. (ii) In fresh clustered db, the keys of the clusters get the same IDs as their representative sequences (the first appearing in the cluster). (iii) clusterupdate tries to have stable cluster keys (i.e. if a cluster is preserved between the old clustering and the updated one, then the associated key will remains the same).

(i) + (ii) + (iii) => updated clustering databases ffindex-keys does not point anymore to the ffindex-keys of their representative sequence.

So that in your TSV file, the cluster composition should be right, but the representative sequence (first column) is wrong. If you want that the first columns actually contains the representative sequence of your clusters, you can pull the last commit of MMseqs2, re-do the cluster updating procedure and call createtsv this way: mmseqs createtsv DB_new DB_new DB_clusterupdate DB_clusterupdate.tsv --first-seq-as-repr

Best, Clovis.

dcpastor commented 7 years ago

Thanks so much for the explanation and the new commit! Everything is working perfect :)

Best, Delia

ClovisG commented 7 years ago

Hi Delia, Thank you for your reports. I'll close the issue, feel free to re-open it if you encounter further problems.

Best, Clovis.