Closed dcpastor closed 7 years ago
Hi, Thanks for reporting this. The 3 error messages you get are not critical. It is due to a recent change in the default parameters of mmseqs that makes it not generating the aln*, clu and input_ files. This shouldn't affect the workflow. From your call, you should actually get a DB_clusterupdate database. Could you check if it is conform to what you expected ?
Concerning the ATCG warning, it is just a warning to the user in case there are some sequences containing only residues A, T, C and G. If sometimes you do not have the warning it may may be due to the fact the tmp folder was not empty and mmseqs did not perform the search again.
Best, Clovis.
Thanks for your reply, You are right, I get a DB_clusterupdate and DB_clusterupdate.index when the process finishes. The reason why I though clusterupdate didn't finish is that when I try to extract the fasta file from DB_clusterupdate I get the following error:
mmseqs createseqfiledb DB_new DB_clusterupdate clu_rm
I get another error:
MMseqs Version: 5ba68d8799901889e4b760c30e98fdc31ef8d572 Min Sequences 1 Max Sequences 2147483647 HH format false Threads 4 Verbosity 3
Invalid entry in line 1! Invalid entry in line 2! Invalid entry in line 3! Invalid entry in line 4! Invalid entry in line 5! Invalid entry in line 6!
I thought that maybe DB_new didn't contain the link to the sequences that DB_clusterupdate refers to and the database that merges DB_trimmed and DB_new is missing, probably because clusterupdate stopped before it was generated due to the problems I refer in the first post.
But if this is not the case, how can I generate the fasta file from DB_updatecluster? Which database should I use?
Thanks beforehand.
Hi, Thank you for your report. If you pull the last commit, it should solve your problem.
Best, Clovis.
Thanks, now it is able to generate the fasta file, although 6 sequences are missing. Those 6 are lost in the createseqfiledb step.
mmseqs createseqfiledb DB_new DB_clusterupdate clu_rm
Program call: DB_new DB_clusterupdate clu_rmMMseqs Version: d410b871f195076386dc3d7382db26e976cb6db5 Min Sequences 1 Max Sequences 2147483647 HH format false Threads 4 Verbosity 3
Invalid entry in cluster 12, line 1! Invalid entry in cluster 12, line 2! Invalid entry in cluster 12, line 3! Invalid entry in cluster 12, line 4! Invalid entry in cluster 12, line 5! Invalid entry in cluster 12, line 6! Time for merging files: 0 h 0 m 0 s
Hi, Could you upload somewhere your files and call script so that I can take a closer look to it ?
Best, Clovis.
Hi, Sorry, I should have started from scratch with the new commit... I did it and I don't get that "Invalid entry in cluster..." error. However, the representatives of each cluster are not part of the elements of the cluster (see the .tsv and .fa in the results folder). The code and datasets that I used are contained here: https://www.dropbox.com/sh/pd9qdkrq084lu29/AACL60pabYsgUhqnCEnsASgea?dl=0
My results are: https://www.dropbox.com/sh/umndghqhmibogdu/AADulxnmCH66-rtEAkjtBD9ga?dl=0
Hi, I'm glad that the cluster update works for you now.
Concerning the generated TSV file: (i) When you have two different databases DB_old and DB_new, then the ffindex keys do not necessarily match. (ii) In fresh clustered db, the keys of the clusters get the same IDs as their representative sequences (the first appearing in the cluster). (iii) clusterupdate tries to have stable cluster keys (i.e. if a cluster is preserved between the old clustering and the updated one, then the associated key will remains the same).
(i) + (ii) + (iii) => updated clustering databases ffindex-keys does not point anymore to the ffindex-keys of their representative sequence.
So that in your TSV file, the cluster composition should be right, but the representative sequence (first column) is wrong. If you want that the first columns actually contains the representative sequence of your clusters, you can pull the last commit of MMseqs2, re-do the cluster updating procedure and call createtsv this way: mmseqs createtsv DB_new DB_new DB_clusterupdate DB_clusterupdate.tsv --first-seq-as-repr
Best, Clovis.
Thanks so much for the explanation and the new commit! Everything is working perfect :)
Best, Delia
Hi Delia, Thank you for your reports. I'll close the issue, feel free to re-open it if you encounter further problems.
Best, Clovis.
I was trying to use clusterupdate to update a clustering (DB_trimmed_clu) build from DB_trimmed (a library of proteins) to DB_clusterupdate from a extended version of the library (DB_new) with 2 sequences overlap.
However, the program is not able to finish and I get the error:
mv: rename tmp/aln* to tmp/search/aln: No such file or directory mv: rename tmp/clu_ to tmp/search/clu*: No such file or directory mv: rename tmp/input to tmp/search/input_: No such file or directory
Although the program is able to continue until the merging of the updated clusterings (see log below).
I also get a random number of warnings (depending on the execution) pointing out that I am using DNA, but I am not. For instance:
WARNING: Sequence (dbKey=17) contains only ATGC. It might be a nucleotide sequence.
I attach the log of the cluster update call:
mmseqs clusterupdate DB_trimmed DB_new DB_trimmed_clu DB_clusterupdate tmp &> update_log.txt